1
00:00:06,804 --> 00:00:10,610
[students murmuring]

2
00:00:10,610 --> 00:00:15,375
- Okay, so good afternoon
everyone, let's get started.

3
00:00:15,375 --> 00:00:18,257
So hi, so for those of
you who I haven't met yet,

4
00:00:18,257 --> 00:00:21,470
my name is Serena Yeung and I'm the third

5
00:00:21,470 --> 00:00:24,853
and final instructor for this class,

6
00:00:24,853 --> 00:00:28,307
and I'm also a PhD student
in Fei-Fei's group.

7
00:00:28,307 --> 00:00:31,292
Okay, so today we're going
to talk about backpropagation

8
00:00:31,292 --> 00:00:33,843
and neural networks, and so
now we're really starting

9
00:00:33,843 --> 00:00:37,346
to get to some of the core
material in this class.

10
00:00:37,346 --> 00:00:39,929
Before we begin, let's see, oh.

11
00:00:42,124 --> 00:00:44,166
So a few administrative details,

12
00:00:44,166 --> 00:00:47,602
so assignment one is due
Thursday, April 20th,

13
00:00:47,602 --> 00:00:51,522
so a reminder, we shifted
the date back by a little bit

14
00:00:51,522 --> 00:00:55,105
and it's going to be due
11:59 p.m. on Canvas.

15
00:00:56,722 --> 00:00:59,316
So you should start thinking
about your projects,

16
00:00:59,316 --> 00:01:02,327
there are TA specialties
listed on the Piazza website

17
00:01:02,327 --> 00:01:05,432
so if you have questions
about a specific project topic

18
00:01:05,432 --> 00:01:08,958
you're thinking about, you
can go and try and find

19
00:01:08,958 --> 00:01:12,682
the TAs that might be most relevant.

20
00:01:12,682 --> 00:01:15,910
And then also for Google Cloud,
so all students are going

21
00:01:15,910 --> 00:01:19,130
to get $100 in credits
to use for Google Cloud

22
00:01:19,130 --> 00:01:21,268
for their assignments and project,

23
00:01:21,268 --> 00:01:22,630
so you should be receiving an email

24
00:01:22,630 --> 00:01:24,285
for that this week, I think.

25
00:01:24,285 --> 00:01:25,666
A lot of you may have already, and then

26
00:01:25,666 --> 00:01:28,227
for those of you who haven't,
they're going to come,

27
00:01:28,227 --> 00:01:31,060
should be by the end of this week.

28
00:01:32,544 --> 00:01:35,721
Okay so where we are, so
far we've talked about

29
00:01:35,721 --> 00:01:39,207
how to define a classifier
using a function f,

30
00:01:39,207 --> 00:01:41,669
parameterized by weights
W, and this function f

31
00:01:41,669 --> 00:01:45,836
is going to take data x as input,
and output a vector of scores

32
00:01:48,027 --> 00:01:51,432
for each of the classes
that you want to classify.

33
00:01:51,432 --> 00:01:54,778
And so from here we can also define

34
00:01:54,778 --> 00:01:57,314
a loss function, so for
example, the SVM loss function

35
00:01:57,314 --> 00:02:00,445
that we've talked about
which basically quantifies

36
00:02:00,445 --> 00:02:02,053
how happy or unhappy we are with

37
00:02:02,053 --> 00:02:04,167
the scores that we've produced, right,

38
00:02:04,167 --> 00:02:07,727
and then we can use that to
define a total loss term.

39
00:02:07,727 --> 00:02:10,757
So L here, which is a
combination of this data term,

40
00:02:10,757 --> 00:02:14,617
combined with a regularization
term that expresses

41
00:02:14,617 --> 00:02:16,601
how simple our model is,
and we have a preference

42
00:02:16,601 --> 00:02:20,201
for simpler models, for
better generalization.

43
00:02:20,201 --> 00:02:23,295
And so now we want to
find the parameters W

44
00:02:23,295 --> 00:02:25,209
that correspond to our lowest loss, right?

45
00:02:25,209 --> 00:02:27,092
We want to minimize the loss function,

46
00:02:27,092 --> 00:02:28,497
and so to do that we want to find

47
00:02:28,497 --> 00:02:31,497
the gradient of L with respect to W.

48
00:02:33,275 --> 00:02:35,829
So last lecture we talked
about how we can do this

49
00:02:35,829 --> 00:02:37,735
using optimization, and we're going

50
00:02:37,735 --> 00:02:39,976
to iteratively take steps in the direction

51
00:02:39,976 --> 00:02:42,817
of steepest descent, which is
the negative of the gradient,

52
00:02:42,817 --> 00:02:44,819
in order to walk down this loss landscape

53
00:02:44,819 --> 00:02:47,291
and get to the point
of lowest loss, right?

54
00:02:47,291 --> 00:02:51,947
And we saw how this gradient
descent can basically take

55
00:02:51,947 --> 00:02:54,770
this trajectory, looking
like this image on the right,

56
00:02:54,770 --> 00:02:59,293
getting to the bottom
of your loss landscape.

57
00:02:59,293 --> 00:03:00,126
Oh!

58
00:03:02,247 --> 00:03:04,376
Okay, and so we also
talked about different ways

59
00:03:04,376 --> 00:03:06,154
for computing a gradient, right?

60
00:03:06,154 --> 00:03:08,538
We can compute this numerically

61
00:03:08,538 --> 00:03:10,696
using finite difference approximation

62
00:03:10,696 --> 00:03:13,444
which is slow and approximate,
but at the same time

63
00:03:13,444 --> 00:03:14,725
it's really easy to write out,

64
00:03:14,725 --> 00:03:17,589
you know you can always
get the gradient this way.

65
00:03:17,589 --> 00:03:21,064
We also talked about how to
use the analytic gradient

66
00:03:21,064 --> 00:03:23,113
and computing this is, it's fast

67
00:03:23,113 --> 00:03:25,227
and exact once you've
gotten the expression for

68
00:03:25,227 --> 00:03:27,499
the analytic gradient, but
at the same time you have

69
00:03:27,499 --> 00:03:29,670
to do all the math and the
calculus to derive this,

70
00:03:29,670 --> 00:03:32,734
so it's also, you know, easy
to make mistakes, right?

71
00:03:32,734 --> 00:03:34,925
So in practice what we want
to do is we want to derive

72
00:03:34,925 --> 00:03:37,620
the analytic gradient and use this,

73
00:03:37,620 --> 00:03:39,837
but at the same time check
our implementation using

74
00:03:39,837 --> 00:03:41,267
the numerical gradient to make sure

75
00:03:41,267 --> 00:03:44,600
that we've gotten all of our math right.

76
00:03:46,032 --> 00:03:48,518
So today we're going to
talk about how to compute

77
00:03:48,518 --> 00:03:52,261
the analytic gradient for
arbitrarily complex functions,

78
00:03:52,261 --> 00:03:55,824
using a framework that I'm going
to call computational graphs.

79
00:03:55,824 --> 00:03:58,506
And so basically what a
computational graph is,

80
00:03:58,506 --> 00:04:00,555
is that we can use this
kind of graph in order

81
00:04:00,555 --> 00:04:04,010
to represent any function,
where the nodes of the graph

82
00:04:04,010 --> 00:04:06,611
are steps of computation
that we go through.

83
00:04:06,611 --> 00:04:07,952
So for example, in this example,

84
00:04:07,952 --> 00:04:11,306
the linear classifier
that we've talked about,

85
00:04:11,306 --> 00:04:14,737
the inputs here are x and W, right,

86
00:04:14,737 --> 00:04:18,180
and then this multiplication
node represents

87
00:04:18,180 --> 00:04:21,321
the matrix multiplier,
the multiplication of

88
00:04:21,322 --> 00:04:25,038
the parameters W with
our data x that we have,

89
00:04:25,038 --> 00:04:27,478
outputting our vector of scores.

90
00:04:27,478 --> 00:04:29,469
And then we have another
computational node

91
00:04:29,469 --> 00:04:31,948
which represents our hinge loss, right,

92
00:04:31,948 --> 00:04:35,113
computing our data loss term, Li.

93
00:04:35,113 --> 00:04:38,198
And we also have this
regularization term at

94
00:04:38,198 --> 00:04:39,340
the bottom right, so this node

95
00:04:39,340 --> 00:04:42,964
which computes our regularization term,

96
00:04:42,964 --> 00:04:44,877
and then our total loss
here at the end, L,

97
00:04:44,877 --> 00:04:49,044
is the sum of the regularization
term and the data term.

98
00:04:50,624 --> 00:04:52,325
And the advantage is
that once we can express

99
00:04:52,325 --> 00:04:54,675
a function using a computational graph,

100
00:04:54,675 --> 00:04:58,103
then we can use a technique
that we call backpropagation

101
00:04:58,103 --> 00:05:00,261
which is going to recursively
use the chain rule

102
00:05:00,261 --> 00:05:02,227
in order to compute the gradient

103
00:05:02,227 --> 00:05:05,568
with respect to every variable
in the computational graph,

104
00:05:05,568 --> 00:05:09,830
and so we're going to
see how this is done.

105
00:05:09,830 --> 00:05:11,675
And this becomes very
useful when we start working

106
00:05:11,675 --> 00:05:13,413
with really complex functions,

107
00:05:13,413 --> 00:05:15,502
so for example,
convolutional neural networks

108
00:05:15,502 --> 00:05:17,836
that we're going to talk
about later in this class.

109
00:05:17,836 --> 00:05:20,900
We have here the input image at the top,

110
00:05:20,900 --> 00:05:22,364
we have our loss at the bottom,

111
00:05:22,364 --> 00:05:24,023
and the input has to
go through many layers

112
00:05:24,023 --> 00:05:26,269
of transformations in order to get all

113
00:05:26,269 --> 00:05:29,102
the way down to the loss function.

114
00:05:30,896 --> 00:05:33,165
And this can get even
crazier with things like,

115
00:05:33,165 --> 00:05:35,447
the, you know, like a
neural turing machine,

116
00:05:35,447 --> 00:05:37,869
which is another kind
of deep learning model,

117
00:05:37,869 --> 00:05:39,915
and in this case you can see
that the computational graph

118
00:05:39,915 --> 00:05:43,413
for this is really insane, and especially,

119
00:05:43,413 --> 00:05:45,897
we end up, you know,
unrolling this over time.

120
00:05:45,897 --> 00:05:47,562
It's basically completely impractical

121
00:05:47,562 --> 00:05:49,901
if you want to compute the gradients

122
00:05:49,901 --> 00:05:53,234
for any of these intermediate variables.

123
00:05:55,560 --> 00:05:59,139
Okay, so how does backpropagation work?

124
00:05:59,139 --> 00:06:01,838
So we're going to start
off with a simple example,

125
00:06:01,838 --> 00:06:03,672
where again, our goal is
that we have a function.

126
00:06:03,672 --> 00:06:06,228
So in this case, f of x, y, z

127
00:06:06,228 --> 00:06:08,614
equals x plus y times z,

128
00:06:08,614 --> 00:06:10,746
and we want to find the
gradients of the output of

129
00:06:10,746 --> 00:06:13,572
the function with respect
to any of the variables.

130
00:06:13,572 --> 00:06:15,365
So the first step, always, is we want

131
00:06:15,365 --> 00:06:17,329
to take our function f, and we want

132
00:06:17,329 --> 00:06:20,623
to represent it using
a computational graph.

133
00:06:20,623 --> 00:06:23,339
Right, so here our computational
graph is on the right,

134
00:06:23,339 --> 00:06:25,957
and you can see that we have our,

135
00:06:25,957 --> 00:06:27,892
first we have the plus node, so x plus y,

136
00:06:27,892 --> 00:06:30,044
and then we have this
multiplication node, right,

137
00:06:30,044 --> 00:06:34,060
for the second computation
that we're doing.

138
00:06:34,060 --> 00:06:36,201
And then, now we're going
to do a forward pass

139
00:06:36,201 --> 00:06:38,732
of this network, so given the values of

140
00:06:38,732 --> 00:06:40,782
the variables that we have, so here,

141
00:06:40,782 --> 00:06:42,944
x equals negative two, y equals five

142
00:06:42,944 --> 00:06:46,251
and z equals negative four,
I'm going to fill these all in

143
00:06:46,251 --> 00:06:49,417
in our computational graph,
and then here we can compute

144
00:06:49,417 --> 00:06:52,965
an intermediate value,
so x plus y gives three,

145
00:06:52,965 --> 00:06:55,045
and then finally we pass it through again,

146
00:06:55,045 --> 00:06:56,990
through the last node, the multiplication,

147
00:06:56,990 --> 00:07:00,823
to get our final node
of f equals negative 12.

148
00:07:04,310 --> 00:07:08,784
So here we want to give every
intermediate variable a name.

149
00:07:08,784 --> 00:07:10,438
So here I've called this
intermediate variable after

150
00:07:10,438 --> 00:07:14,997
the plus node q, and we
have q equals x plus y,

151
00:07:14,997 --> 00:07:18,609
and then f equals q times z,
using this intermediate node.

152
00:07:18,609 --> 00:07:21,341
And I've also written
out here, the gradients

153
00:07:21,341 --> 00:07:24,763
of q with respect to x
and y, which are just one

154
00:07:24,763 --> 00:07:28,131
because of the addition,
and then the gradients of f

155
00:07:28,131 --> 00:07:31,653
with respect to q and z,
which is z and q respectively

156
00:07:31,653 --> 00:07:34,607
because of the multiplication rule.

157
00:07:34,607 --> 00:07:36,598
And so what we want to
find, is we want to find

158
00:07:36,598 --> 00:07:40,431
the gradients of f with
respect to x, y and z.

159
00:07:43,356 --> 00:07:46,822
So what backprop is, it's
a recursive application of

160
00:07:46,822 --> 00:07:49,182
the chain rule, so we're
going to start at the back,

161
00:07:49,182 --> 00:07:51,311
the very end of the computational graph,

162
00:07:51,311 --> 00:07:53,360
and then we're going to
work our way backwards

163
00:07:53,360 --> 00:07:56,373
and compute all the
gradients along the way.

164
00:07:56,373 --> 00:07:59,015
So here if we start at
the very end, right,

165
00:07:59,015 --> 00:08:01,217
we want to compute the
gradient of the output

166
00:08:01,217 --> 00:08:04,674
with respect to the last
variable, which is just f.

167
00:08:04,674 --> 00:08:08,591
And so this gradient is
just one, it's trivial.

168
00:08:10,065 --> 00:08:12,604
So now, moving backwards,
we want the gradient

169
00:08:12,604 --> 00:08:15,687
with respect to z, right, and we know

170
00:08:16,637 --> 00:08:19,137
that df over dz is equal to q.

171
00:08:19,993 --> 00:08:22,636
So the value of q is just three,

172
00:08:22,636 --> 00:08:26,386
and so we have here, df
over dz equals three.

173
00:08:28,819 --> 00:08:31,688
And so next if we want to do df over dq,

174
00:08:31,688 --> 00:08:33,855
what is the value of that?

175
00:08:36,732 --> 00:08:37,957
What is df over dq?

176
00:08:37,957 --> 00:08:42,040
So we have here, df over
dq is equal to z, right,

177
00:08:46,012 --> 00:08:49,012
and the value of z is negative four.

178
00:08:49,963 --> 00:08:54,130
So here we have df over dq
is equal to negative four.

179
00:08:57,485 --> 00:09:00,802
Okay, so now continuing to
move backwards to the graph,

180
00:09:00,802 --> 00:09:03,793
we want to find df over dy, right,

181
00:09:03,793 --> 00:09:06,251
but here in this case, the
gradient with respect to y,

182
00:09:06,251 --> 00:09:08,986
y is not connected directly to f, right?

183
00:09:08,986 --> 00:09:12,838
It's connected through an
intermediate node of z,

184
00:09:12,838 --> 00:09:14,756
and so the way we're going to do this

185
00:09:14,756 --> 00:09:17,998
is we can leverage the
chain rule which says

186
00:09:17,998 --> 00:09:21,748
that df over dy can be
written as df over dq,

187
00:09:22,746 --> 00:09:26,754
times dq over dy, and
so the intuition of this

188
00:09:26,754 --> 00:09:30,707
is that in order to get to
find the effect of y on f,

189
00:09:30,707 --> 00:09:33,348
this is actually equivalent to if we take

190
00:09:33,348 --> 00:09:37,416
the effect of q times q on f,
which we already know, right?

191
00:09:37,416 --> 00:09:41,330
df over dq is equal to negative four,

192
00:09:41,330 --> 00:09:45,497
and we compound it with the
effect of y on q, dq over dy.

193
00:09:46,604 --> 00:09:50,986
So what's dq over dy
equal to in this case?

194
00:09:50,986 --> 00:09:51,819
- [Student] One.

195
00:09:51,819 --> 00:09:52,980
- One, right. Exactly.

196
00:09:52,980 --> 00:09:55,916
So dq over dy is equal to
one, which means, you know,

197
00:09:55,916 --> 00:09:57,984
if we change y by a little bit,

198
00:09:57,984 --> 00:09:59,408
q is going to change by approximately

199
00:09:59,408 --> 00:10:01,627
the same amount right, this is the effect,

200
00:10:01,627 --> 00:10:05,585
and so what this is
doing is this is saying,

201
00:10:05,585 --> 00:10:07,474
well if I change y by a little bit,

202
00:10:07,474 --> 00:10:10,807
the effect of y on q is going to be one,

203
00:10:13,249 --> 00:10:16,882
and then the effect of q on f
is going to be approximately

204
00:10:16,882 --> 00:10:18,628
a factor of negative four, right?

205
00:10:18,628 --> 00:10:20,405
So then we multiply these together

206
00:10:20,405 --> 00:10:24,053
and we get that the effect of y on f

207
00:10:24,053 --> 00:10:26,470
is going to be negative four.

208
00:10:30,887 --> 00:10:33,249
Okay, so now if we want
to do the same thing for

209
00:10:33,249 --> 00:10:35,632
the gradient with respect to x, right,

210
00:10:35,632 --> 00:10:38,074
we can do the, we can
follow the same procedure,

211
00:10:38,074 --> 00:10:41,191
and so what is this going to be?

212
00:10:41,191 --> 00:10:42,711
[students speaking away from microphone]

213
00:10:42,711 --> 00:10:44,746
- I heard the same.

214
00:10:44,746 --> 00:10:48,746
Yeah exactly, so in this
case we want to, again,

215
00:10:49,636 --> 00:10:51,051
apply the chain rule, right?

216
00:10:51,051 --> 00:10:55,882
We know the effect of q on
f is negative four,

217
00:10:55,882 --> 00:10:59,167
and here again, since we have
also the same addition node,

218
00:10:59,167 --> 00:11:01,958
dq over dx is equal to one, again,

219
00:11:01,958 --> 00:11:04,593
we have negative four times
one, right, and the gradient

220
00:11:04,593 --> 00:11:08,760
with respect to x is
going to be negative four.

221
00:11:11,467 --> 00:11:13,214
Okay, so what we're doing is, in backprop,

222
00:11:13,214 --> 00:11:15,389
is we basically have all of these nodes

223
00:11:15,389 --> 00:11:17,846
in our computational graph, but each node

224
00:11:17,846 --> 00:11:20,724
is only aware of its
immediate surroundings, right?

225
00:11:20,724 --> 00:11:23,874
So we have, at each node,
we have the local inputs

226
00:11:23,874 --> 00:11:25,579
that are connected to this node,

227
00:11:25,579 --> 00:11:27,312
the values that are flowing into the node,

228
00:11:27,312 --> 00:11:28,981
and then we also have the output

229
00:11:28,981 --> 00:11:32,432
that is directly outputted from this node.

230
00:11:32,432 --> 00:11:36,599
So here our local inputs are
x and y, and the output is z.

231
00:11:39,777 --> 00:11:43,469
And at this node we also know
the local gradient, right,

232
00:11:43,469 --> 00:11:46,607
we can compute the gradient
of z with respect to x,

233
00:11:46,607 --> 00:11:48,905
and the gradient of z with respect to y,

234
00:11:48,905 --> 00:11:51,217
and these are usually really
simple operations, right?

235
00:11:51,217 --> 00:11:53,115
Each node is going to be something like

236
00:11:53,115 --> 00:11:54,821
the addition or the multiplication

237
00:11:54,821 --> 00:11:56,786
that we had in that earlier example,

238
00:11:56,786 --> 00:11:58,276
which is something where
we can just write down

239
00:11:58,276 --> 00:12:00,158
the gradient, and we
don't have to, you know,

240
00:12:00,158 --> 00:12:04,665
go through very complex
calculus in order to find this.

241
00:12:04,665 --> 00:12:07,513
- [Student] Can you go
back and explain why

242
00:12:07,513 --> 00:12:11,699
more in the last slide was
different than planning

243
00:12:11,699 --> 00:12:15,313
the first part of it using
just normal calculus?

244
00:12:15,313 --> 00:12:18,683
- Yeah, so basically if we go back,

245
00:12:18,683 --> 00:12:20,183
hold on, let me...

246
00:12:21,740 --> 00:12:24,472
So if we go back here, we
could exactly write out,

247
00:12:24,472 --> 00:12:26,565
find all of these using just calculus,

248
00:12:26,565 --> 00:12:30,216
so we could say, you know,
we want df over dx, right,

249
00:12:30,216 --> 00:12:32,572
and we can probably
expand out this expression

250
00:12:32,572 --> 00:12:35,338
and see that it's just going to be z,

251
00:12:35,338 --> 00:12:37,056
but we can do this for, in this case,

252
00:12:37,056 --> 00:12:39,526
because it's simple, but
we'll see examples later on

253
00:12:39,526 --> 00:12:42,667
where once this becomes a
really complicated expression,

254
00:12:42,667 --> 00:12:44,916
you don't want to have to use calculus

255
00:12:44,916 --> 00:12:47,487
to derive, right, the
gradient for something,

256
00:12:47,487 --> 00:12:49,302
for a super-complicated expression,

257
00:12:49,302 --> 00:12:52,094
and instead, if you use this formalism

258
00:12:52,094 --> 00:12:55,018
and you break it down into
these computational nodes,

259
00:12:55,018 --> 00:12:58,001
then you can only ever work with gradients

260
00:12:58,001 --> 00:13:01,612
of very simple computations, right,

261
00:13:01,612 --> 00:13:04,925
at the level of, you know,
additions, multiplications,

262
00:13:04,925 --> 00:13:06,938
exponentials, things as
simple as you want them,

263
00:13:06,938 --> 00:13:08,743
and then you just use the chain rule

264
00:13:08,743 --> 00:13:10,038
to multiply all these together,

265
00:13:10,038 --> 00:13:12,404
and get your, the value of your gradient

266
00:13:12,404 --> 00:13:16,571
without having to ever
derive the entire expression.

267
00:13:18,562 --> 00:13:20,508
Does that make sense?

268
00:13:20,508 --> 00:13:21,367
[student murmuring]

269
00:13:21,367 --> 00:13:25,034
Okay, so we'll see an
example of this later.

270
00:13:28,751 --> 00:13:30,449
And so, was there another question, yeah?

271
00:13:30,449 --> 00:13:32,240
[student speaking away from microphone]

272
00:13:32,240 --> 00:13:33,778
- [Student] What's the negative four

273
00:13:33,778 --> 00:13:36,146
next to the z representing?

274
00:13:36,146 --> 00:13:38,408
- Negative, okay yeah,
so the negative four,

275
00:13:38,408 --> 00:13:39,884
these were the, the green values on top

276
00:13:39,884 --> 00:13:43,831
were all the values of
the function as we passed

277
00:13:43,831 --> 00:13:46,623
it forward through the
computational graph, right?

278
00:13:46,623 --> 00:13:49,835
So we said up here that x
is equal to negative two,

279
00:13:49,835 --> 00:13:52,591
y is equal to five, and
z equals negative four,

280
00:13:52,591 --> 00:13:55,621
so we filled in all of these
values, and then we just wanted

281
00:13:55,621 --> 00:13:59,286
to compute the value of this function.

282
00:13:59,286 --> 00:14:04,008
Right, so we said this value
of q is going to be x plus y,

283
00:14:04,008 --> 00:14:06,118
it's going to be negative
two plus five, it is going

284
00:14:06,118 --> 00:14:09,289
to be three, and we have z
is equal to negative four

285
00:14:09,289 --> 00:14:12,285
so we fill that in here,
and then we multiplied q

286
00:14:12,285 --> 00:14:14,192
and z together, negative four times three

287
00:14:14,192 --> 00:14:16,886
in order to get the
final value of f, right?

288
00:14:16,886 --> 00:14:18,175
And then the red values underneath

289
00:14:18,175 --> 00:14:20,540
were as we were filling in the gradients

290
00:14:20,540 --> 00:14:22,957
as we were working backwards.

291
00:14:24,927 --> 00:14:25,760
Okay.

292
00:14:29,418 --> 00:14:33,356
Okay, so right, so we said that, you know,

293
00:14:33,356 --> 00:14:34,845
we have these local, these nodes,

294
00:14:34,845 --> 00:14:38,969
and each node basically gets
its local inputs coming in

295
00:14:38,969 --> 00:14:40,886
and the output that it
sees directly passing on

296
00:14:40,886 --> 00:14:44,222
to the next node, and we also
have these local gradients

297
00:14:44,222 --> 00:14:46,302
that we computed, right, the gradient of

298
00:14:46,302 --> 00:14:48,476
the immediate output of the node

299
00:14:48,476 --> 00:14:51,175
with respect to the inputs coming in.

300
00:14:51,175 --> 00:14:55,155
And so what happens during
backprop is we have these,

301
00:14:55,155 --> 00:14:56,750
we'll start from the
back of the graph, right,

302
00:14:56,750 --> 00:14:58,471
and then we work our way from the end

303
00:14:58,471 --> 00:15:00,341
all the way back to the beginning,

304
00:15:00,341 --> 00:15:03,116
and when we reach each
node, at each node we have

305
00:15:03,116 --> 00:15:05,593
the upstream gradients coming back, right,

306
00:15:05,593 --> 00:15:08,980
with respect to the
immediate output of the node.

307
00:15:08,980 --> 00:15:11,626
So by the time we reach
this node in backprop,

308
00:15:11,626 --> 00:15:13,503
we've already computed the gradient

309
00:15:13,503 --> 00:15:17,808
of our final loss l,
with respect to z, right?

310
00:15:17,808 --> 00:15:20,310
And so now what we want to find next

311
00:15:20,310 --> 00:15:22,508
is we want to find the
gradients with respect

312
00:15:22,508 --> 00:15:26,675
to just before the node,
to the values of x and y.

313
00:15:27,679 --> 00:15:30,962
And so as we saw earlier, we
do this using the chain rule,

314
00:15:30,962 --> 00:15:33,053
right, we have from the chain rule,

315
00:15:33,053 --> 00:15:34,857
that the gradient of this loss function

316
00:15:34,857 --> 00:15:36,690
with respect to x is going to be

317
00:15:36,690 --> 00:15:41,482
the gradient with respect
to z times, compounded by

318
00:15:41,482 --> 00:15:44,987
this gradient, local gradient
of z with respect to x.

319
00:15:44,987 --> 00:15:46,369
Right, so in the chain rule we always take

320
00:15:46,369 --> 00:15:48,422
this upstream gradient coming down,

321
00:15:48,422 --> 00:15:50,650
and we multiply it by the local gradient

322
00:15:50,650 --> 00:15:55,071
in order to get the gradient
with respect to the input.

323
00:15:55,071 --> 00:15:57,349
- [Student] So, sorry, is it,

324
00:15:57,349 --> 00:15:59,731
it's different because
this would never work

325
00:15:59,731 --> 00:16:01,949
to get a general formula into the,

326
00:16:01,949 --> 00:16:04,478
or general symbolic
formula for the gradient.

327
00:16:04,478 --> 00:16:06,700
It only works with instantaneous values,

328
00:16:06,700 --> 00:16:07,533
where you like.

329
00:16:07,533 --> 00:16:08,423
[student coughing]

330
00:16:08,423 --> 00:16:11,896
Or passing a little constant
value as a symbolic.

331
00:16:11,896 --> 00:16:16,335
- So the question is
whether this only works

332
00:16:16,335 --> 00:16:19,496
because we're working
with the current values of

333
00:16:19,496 --> 00:16:22,579
the function, and so it works, right,

334
00:16:23,842 --> 00:16:25,745
given the current values of
the function that we plug in,

335
00:16:25,745 --> 00:16:27,611
but we can write an expression for this,

336
00:16:27,611 --> 00:16:29,819
still in terms of the variables, right?

337
00:16:29,819 --> 00:16:33,635
So we'll see that gradient
of L with respect to z

338
00:16:33,635 --> 00:16:36,357
is going to be some
expression, and gradient of z

339
00:16:36,357 --> 00:16:39,463
with respect to x is going to
be another expression, right?

340
00:16:39,463 --> 00:16:43,422
But we plug in these,
we plug in the values

341
00:16:43,422 --> 00:16:45,481
of these numbers at the
time in order to get

342
00:16:45,481 --> 00:16:48,708
the value of the gradient
with respect to x.

343
00:16:48,708 --> 00:16:51,958
So what you could do is you
could recursively plug in

344
00:16:51,958 --> 00:16:55,203
all of these expressions, right?

345
00:16:55,203 --> 00:16:58,239
Gradient with respect, z with respect to x

346
00:16:58,239 --> 00:17:01,442
is going to be a simple,
simple expression, right?

347
00:17:01,442 --> 00:17:03,708
So in this case, if we
have a multiplication node,

348
00:17:03,708 --> 00:17:05,739
gradient of z with
respect to x is just going

349
00:17:05,739 --> 00:17:08,612
to be y, right, we know that,

350
00:17:08,612 --> 00:17:10,898
but the gradient of L with respect to z,

351
00:17:10,898 --> 00:17:12,811
this is probably a complex part of

352
00:17:12,811 --> 00:17:17,354
the graph in itself, right, so
here's where we want to just,

353
00:17:17,355 --> 00:17:20,748
in this case, have this numerical, right?

354
00:17:20,748 --> 00:17:23,105
So as you said, basically
this is going to be just

355
00:17:23,105 --> 00:17:25,423
a number coming down, right, a value,

356
00:17:25,423 --> 00:17:26,540
and then we just multiply it with

357
00:17:26,540 --> 00:17:30,719
the expression that we have
for the local gradient.

358
00:17:30,719 --> 00:17:32,064
And I think this will be
more clear when we go through

359
00:17:32,064 --> 00:17:35,647
a more complicated
example in a few slides.

360
00:17:38,225 --> 00:17:40,685
Okay, so now the gradient
of L with respect to y,

361
00:17:40,685 --> 00:17:44,284
we have exactly the
same idea, where again,

362
00:17:44,284 --> 00:17:45,966
we use the chain rule,
we have gradient of L

363
00:17:45,966 --> 00:17:48,169
with respect to z, times the gradient of z

364
00:17:48,169 --> 00:17:50,661
with respect to y, right,
we use the chain rule,

365
00:17:50,661 --> 00:17:54,411
multiply these together
and get our gradient.

366
00:17:55,848 --> 00:17:57,910
And then once we have these,
we'll pass these on to

367
00:17:57,910 --> 00:18:00,816
the node directly before,
or connected to this node.

368
00:18:00,816 --> 00:18:03,484
And so the main thing
to take away from this

369
00:18:03,484 --> 00:18:06,920
is that at each node we just
want to have our local gradient

370
00:18:06,920 --> 00:18:09,047
that we compute, just keep track of this,

371
00:18:09,047 --> 00:18:12,282
and then during backprop as
we're receiving, you know,

372
00:18:12,282 --> 00:18:16,127
numerical values of gradients
coming from upstream,

373
00:18:16,127 --> 00:18:18,271
we just take what that is, multiply it by

374
00:18:18,271 --> 00:18:20,077
the local gradient, and then this is

375
00:18:20,077 --> 00:18:23,741
what we then send back
to the connected nodes,

376
00:18:23,741 --> 00:18:27,175
the next nodes going backwards,
without having to care

377
00:18:27,175 --> 00:18:31,664
about anything else besides
these immediate surroundings.

378
00:18:31,664 --> 00:18:33,795
So now we're going to go
through another example,

379
00:18:33,795 --> 00:18:35,353
this time a little bit more complex,

380
00:18:35,353 --> 00:18:39,239
so we can see more why
backprop is so useful.

381
00:18:39,239 --> 00:18:43,893
So in this case, our
function is f of w and x,

382
00:18:43,893 --> 00:18:46,626
which is equal to one over one plus e

383
00:18:46,626 --> 00:18:49,576
to the negative of w-zero times x-zero

384
00:18:49,576 --> 00:18:52,619
plus w-one x-one, plus w-two, right?

385
00:18:52,619 --> 00:18:54,978
So again, the first step always is we want

386
00:18:54,978 --> 00:18:57,525
to write this out as
a computational graph.

387
00:18:57,525 --> 00:18:59,798
So in this case we can see
that in this graph, right,

388
00:18:59,798 --> 00:19:02,863
first we multiply together the
w and x terms that we have,

389
00:19:02,863 --> 00:19:06,697
w-zero with x-zero, w-one with x-one,

390
00:19:06,697 --> 00:19:10,388
and w-two, then we add all
of these together, right?

391
00:19:10,388 --> 00:19:13,471
Then we do, scale it by negative one,

392
00:19:14,525 --> 00:19:17,372
we take the exponential, we add one,

393
00:19:17,372 --> 00:19:21,372
and then finally we do
one over this whole term.

394
00:19:22,533 --> 00:19:25,170
And then here I've also
filled in values of these,

395
00:19:25,170 --> 00:19:27,581
so let's say given values that we have

396
00:19:27,581 --> 00:19:30,726
for the ws and xs, right,
we can make a forward pass

397
00:19:30,726 --> 00:19:32,338
and basically compute what the value is

398
00:19:32,338 --> 00:19:35,171
at every stage of the computation.

399
00:19:37,091 --> 00:19:40,056
And here I've also written
down here at the bottom

400
00:19:40,056 --> 00:19:42,986
the values, the expressions
for some derivatives

401
00:19:42,986 --> 00:19:44,427
that are going to be helpful later on,

402
00:19:44,427 --> 00:19:49,339
so same as we did before
with the simple example.

403
00:19:49,339 --> 00:19:51,820
Okay, so now then we're going
to do backprop through here,

404
00:19:51,820 --> 00:19:53,327
right, so again, we're going to start at

405
00:19:53,327 --> 00:19:56,654
the very end of the
graph, and so here again

406
00:19:56,654 --> 00:20:00,736
the gradient of the output with
respect to the last variable

407
00:20:00,736 --> 00:20:04,074
is just one, it's just trivial,

408
00:20:04,074 --> 00:20:07,323
and so now moving
backwards one step, right?

409
00:20:07,323 --> 00:20:10,768
So what's the gradient with respect to

410
00:20:10,768 --> 00:20:13,405
the input just before one over x?

411
00:20:13,405 --> 00:20:18,250
Well, so in this case, we know
that the upstream gradient

412
00:20:18,250 --> 00:20:21,592
that we have coming down,
right, is this red one, right?

413
00:20:21,592 --> 00:20:24,153
This is the upstream gradient
that we have flowing down,

414
00:20:24,153 --> 00:20:26,938
and then now we need to find
the local gradient, right,

415
00:20:26,938 --> 00:20:28,358
and the local gradient of this node,

416
00:20:28,358 --> 00:20:30,181
this node is one over x, right,

417
00:20:30,181 --> 00:20:33,117
so we have f of x equals
one over x here in red,

418
00:20:33,117 --> 00:20:35,871
and the local gradient of this df over dx

419
00:20:35,871 --> 00:20:39,935
is equal to negative one
over x-squared, right?

420
00:20:39,935 --> 00:20:43,700
So here we're going to take
negative one over x-squared,

421
00:20:43,700 --> 00:20:45,845
and plug in the value
of x that we had during

422
00:20:45,845 --> 00:20:50,012
this forward pass, 1.37,
and so our final gradient

423
00:20:51,325 --> 00:20:52,805
with respect to this variable is going

424
00:20:52,805 --> 00:20:56,638
to be negative one over
1.37 squared times one

425
00:20:58,184 --> 00:20:59,934
equals negative 0.53.

426
00:21:04,382 --> 00:21:06,769
So moving back to the next node,

427
00:21:06,769 --> 00:21:09,023
we're going to go through the
exact same process, right?

428
00:21:09,023 --> 00:21:12,868
So here, the gradient
flowing from upstream

429
00:21:12,868 --> 00:21:16,007
is going to be negative 0.53, right,

430
00:21:16,007 --> 00:21:20,365
and here the local gradient,
the node here is a plus one,

431
00:21:20,365 --> 00:21:25,203
and so now looking at our
reference of derivatives at

432
00:21:25,203 --> 00:21:29,287
the bottom, we have that
for a constant plus x,

433
00:21:29,287 --> 00:21:31,729
the local gradient is just one, right?

434
00:21:31,729 --> 00:21:34,209
So what's the gradient with respect

435
00:21:34,209 --> 00:21:37,376
to this variable using the chain rule?

436
00:21:42,883 --> 00:21:44,842
So it's going to be the upstream gradient

437
00:21:44,842 --> 00:21:48,925
of negative 0.53 times
our local gradient of one,

438
00:21:50,021 --> 00:21:52,688
which is equal to negative 0.53.

439
00:21:55,849 --> 00:21:59,604
So let's keep moving
backwards one more step.

440
00:21:59,604 --> 00:22:02,098
So here we have the exponential, right?

441
00:22:02,098 --> 00:22:05,022
So what's the upstream
gradient coming down?

442
00:22:05,022 --> 00:22:08,536
[student speaking away from microphone]

443
00:22:08,536 --> 00:22:11,775
Right, so the upstream
gradient is negative 0.53,

444
00:22:11,775 --> 00:22:14,358
what's the local gradient here?

445
00:22:15,402 --> 00:22:18,002
It's going to be the local
gradient of e to the x, right?

446
00:22:18,002 --> 00:22:22,169
This is an exponential
node, and so our chain rule

447
00:22:23,590 --> 00:22:25,555
is going to tell us that our gradient

448
00:22:25,555 --> 00:22:29,722
is going to be negative 0.53
times e to the power of x,

449
00:22:30,869 --> 00:22:32,495
which in this case is negative one,

450
00:22:32,495 --> 00:22:34,670
from our forward pass, and
this is going to give us

451
00:22:34,670 --> 00:22:37,587
our final gradient of negative 0.2.

452
00:22:40,215 --> 00:22:42,882
Okay, so now one more node here,

453
00:22:44,234 --> 00:22:46,706
the next node is, that
we reach, is going to be

454
00:22:46,706 --> 00:22:48,912
a multiplication with negative one, right?

455
00:22:48,912 --> 00:22:52,729
So here, what's the upstream
gradient coming down?

456
00:22:52,729 --> 00:22:54,090
- [Student] Negative 0.2?

457
00:22:54,090 --> 00:22:56,565
- [Serena] Negative 0.2,
right, and what's going to be

458
00:22:56,565 --> 00:23:01,510
the local gradient, can
look at the reference sheet.

459
00:23:01,510 --> 00:23:03,049
It's going to be, what was it?

460
00:23:03,049 --> 00:23:03,889
I think I heard it.

461
00:23:03,889 --> 00:23:05,205
- [Student] That's minus one?

462
00:23:05,205 --> 00:23:09,680
- It's going to be minus
one, exactly, yeah,

463
00:23:09,680 --> 00:23:13,282
because our local gradient
says it's going to be,

464
00:23:13,282 --> 00:23:15,976
df over dx is a, right, and the value of a

465
00:23:15,976 --> 00:23:19,220
that we scaled x by is negative one here.

466
00:23:19,220 --> 00:23:21,806
So we have here that the gradient

467
00:23:21,806 --> 00:23:24,575
is negative one times negative 0.2,

468
00:23:24,575 --> 00:23:26,825
and so our gradient is 0.2.

469
00:23:29,169 --> 00:23:32,925
Okay, so now we've
reached an addition node,

470
00:23:32,925 --> 00:23:35,816
and so in this case we
have these two branches

471
00:23:35,816 --> 00:23:37,269
both connected to it, right?

472
00:23:37,269 --> 00:23:39,288
So what's the upstream gradient here?

473
00:23:39,288 --> 00:23:43,286
It's going to be 0.2, right,
just as everything else,

474
00:23:43,286 --> 00:23:46,186
and here now the gradient with respect

475
00:23:46,186 --> 00:23:50,122
to each of these branches,
it's an addition, right,

476
00:23:50,122 --> 00:23:52,586
and we saw from before
in our simple example

477
00:23:52,586 --> 00:23:54,199
that when we have an addition node,

478
00:23:54,199 --> 00:23:56,226
the gradient with respect
to each of the inputs

479
00:23:56,226 --> 00:23:59,266
to the addition is just
going to be one, right?

480
00:23:59,266 --> 00:24:03,661
So here, our local gradient
for looking at our top stream

481
00:24:03,661 --> 00:24:06,406
is going to be one times
the upstream gradient

482
00:24:06,406 --> 00:24:10,573
of 0.2, which is going to give
a total gradient of 0.2, right?

483
00:24:12,016 --> 00:24:14,219
And then we, for our bottom branch we'd do

484
00:24:14,219 --> 00:24:18,400
the same thing, right, our
upstream gradient is 0.2,

485
00:24:18,400 --> 00:24:20,269
our local gradient is one again,

486
00:24:20,269 --> 00:24:23,277
and the total gradient is 0.2.

487
00:24:23,277 --> 00:24:26,110
So is everything clear about this?

488
00:24:27,581 --> 00:24:28,414
Okay.

489
00:24:30,649 --> 00:24:32,758
So we have a few more
gradients to fill out,

490
00:24:32,758 --> 00:24:37,648
so moving back now we've
reached w-zero and x-zero,

491
00:24:37,648 --> 00:24:41,086
and so here we have a
multiplication node, right,

492
00:24:41,086 --> 00:24:44,169
so we saw the multiplication
node from before,

493
00:24:44,169 --> 00:24:45,698
it just, the gradient with respect

494
00:24:45,698 --> 00:24:49,506
to one of the inputs just is
the value of the other input.

495
00:24:49,506 --> 00:24:51,171
And so in this case, what's the gradient

496
00:24:51,171 --> 00:24:53,088
with respect to w-zero?

497
00:24:56,927 --> 00:24:58,795
- [Student] Minus 0.2.

498
00:24:58,795 --> 00:25:02,599
- Minus, I'm hearing minus 0.2, exactly.

499
00:25:02,599 --> 00:25:04,366
Yeah, so with respect to w-zero,

500
00:25:04,366 --> 00:25:07,866
we have our upstream gradient, 0.2, right,

501
00:25:09,747 --> 00:25:11,481
times our, this is the bottom one,

502
00:25:11,481 --> 00:25:13,781
times our value of x,
which is negative one,

503
00:25:13,781 --> 00:25:16,472
we get negative 0.2 and
we can do the same thing

504
00:25:16,472 --> 00:25:18,800
for our gradient with respect to x-zero.

505
00:25:18,800 --> 00:25:22,008
It's going to be 0.2
times the value of w-zero

506
00:25:22,008 --> 00:25:24,425
which is two, and we get 0.4.

507
00:25:26,525 --> 00:25:30,692
Okay, so here we've filled
out most of these gradients,

508
00:25:32,200 --> 00:25:35,805
and so there was the question earlier

509
00:25:35,805 --> 00:25:40,128
about why this is simpler
than just computing,

510
00:25:40,128 --> 00:25:43,071
deriving the analytic gradient,
the expression with respect

511
00:25:43,071 --> 00:25:44,558
to any of these variables, right?

512
00:25:44,558 --> 00:25:47,166
And so you can see here,
all we ever dealt with

513
00:25:47,166 --> 00:25:50,164
was expressions for local gradients

514
00:25:50,164 --> 00:25:52,165
that we had to write out, so
once we had these expressions

515
00:25:52,165 --> 00:25:54,034
for local gradients, all we did

516
00:25:54,034 --> 00:25:56,850
was plug in the values for
each of these that we have,

517
00:25:56,850 --> 00:25:59,656
and use the chain rule to
numerically multiply this all

518
00:25:59,656 --> 00:26:01,532
the way backwards and get the gradients

519
00:26:01,532 --> 00:26:04,615
with respect to all of the variables.

520
00:26:08,779 --> 00:26:11,644
And so, you know, we can also fill out

521
00:26:11,644 --> 00:26:14,490
the gradients with respect
to w-one and x-one here

522
00:26:14,490 --> 00:26:16,535
in exactly the same way, and so one thing

523
00:26:16,535 --> 00:26:19,763
that I want to note is that
right when we're creating

524
00:26:19,763 --> 00:26:21,996
these computational graphs, we can define

525
00:26:21,996 --> 00:26:25,598
the computational nodes at any
granularity that we want to.

526
00:26:25,598 --> 00:26:27,896
So in this case, we broke it down into

527
00:26:27,896 --> 00:26:29,813
the absolute simplest
that we could, right,

528
00:26:29,813 --> 00:26:33,320
we broke it down into
additions and multiplications,

529
00:26:33,320 --> 00:26:36,340
you know, it basically can't
get any simpler than that,

530
00:26:36,340 --> 00:26:38,883
but in practice, right,
we can group some of

531
00:26:38,883 --> 00:26:42,706
these nodes together into
more complex nodes if we want.

532
00:26:42,706 --> 00:26:44,600
As long as we're able to write down

533
00:26:44,600 --> 00:26:47,245
the local gradient for that node, right?

534
00:26:47,245 --> 00:26:51,412
And so as an example, if we
look at a sigmoid function,

535
00:26:53,153 --> 00:26:55,691
so I've defined the sigmoid function in

536
00:26:55,691 --> 00:26:57,861
the upper-right here, of a sigmoid of x

537
00:26:57,861 --> 00:27:00,854
is equal to one over one
plus e to the negative x,

538
00:27:00,854 --> 00:27:03,404
and this is something that's
a really common function

539
00:27:03,404 --> 00:27:05,996
that you'll see a lot in
the rest of this class,

540
00:27:05,996 --> 00:27:09,904
and we can compute the gradient for this,

541
00:27:09,904 --> 00:27:12,686
we can write it out, and if
we do actually go through

542
00:27:12,686 --> 00:27:16,427
the math of doing this analytically,

543
00:27:16,427 --> 00:27:18,173
we can get a nice expression at the end.

544
00:27:18,173 --> 00:27:22,598
So in this case it's equal
to one minus sigma of x,

545
00:27:22,598 --> 00:27:26,609
so the output of this function
times sigma of x, right?

546
00:27:26,609 --> 00:27:29,534
And so in cases where we
have something like this,

547
00:27:29,534 --> 00:27:33,210
we could just take all the computations

548
00:27:33,210 --> 00:27:35,730
that we had in our graph
that made up this sigmoid,

549
00:27:35,730 --> 00:27:37,038
and we could just replace it

550
00:27:37,038 --> 00:27:39,837
with one big node that's a sigmoid, right,

551
00:27:39,837 --> 00:27:43,655
because we do know the local
gradient for this gate,

552
00:27:43,655 --> 00:27:48,564
it's this expression, d of the
sigmoid of x over dx, right?

553
00:27:48,564 --> 00:27:51,260
So basically the important thing here is

554
00:27:51,260 --> 00:27:54,795
that you can, group any nodes that you want

555
00:27:54,795 --> 00:27:58,791
to make any sorts of a little
bit more complex nodes,

556
00:27:58,791 --> 00:28:01,645
as long as you can write down
the local gradient for this.

557
00:28:01,645 --> 00:28:04,268
And so all this is is
basically a trade-off between,

558
00:28:04,268 --> 00:28:06,433
you know, how much math
that you want to do

559
00:28:06,433 --> 00:28:11,043
in order to get a more, kind
of concise and simpler graph,

560
00:28:11,043 --> 00:28:13,793
right, versus how simple you want

561
00:28:14,913 --> 00:28:16,849
each of your gradients to be, right?

562
00:28:16,849 --> 00:28:19,290
And then you can write out as complex of

563
00:28:19,290 --> 00:28:21,616
a computational graph that you want.

564
00:28:21,616 --> 00:28:23,358
Yeah, question?

565
00:28:23,358 --> 00:28:25,278
- [Student] This is a
question on the graph itself,

566
00:28:25,278 --> 00:28:28,149
is there a reason that the
first two multiplication nodes

567
00:28:28,149 --> 00:28:32,055
and the weights are not connected
to a single addition node?

568
00:28:32,055 --> 00:28:33,486
- So they could also be connected into

569
00:28:33,486 --> 00:28:36,771
a single addition node,
so the question was,

570
00:28:36,771 --> 00:28:39,835
is there a reason why w-zero and x-zero

571
00:28:39,835 --> 00:28:41,868
are not connected with w-two?

572
00:28:41,868 --> 00:28:44,167
All of these additions
just connected together,

573
00:28:44,167 --> 00:28:46,179
and yeah, so the reason, the answer is

574
00:28:46,179 --> 00:28:48,432
that you can do that if you want,

575
00:28:48,432 --> 00:28:50,479
and in practice, maybe you
would actually want to do that

576
00:28:50,479 --> 00:28:52,821
because this is still a
very simple node, right?

577
00:28:52,821 --> 00:28:56,658
So in this case I just wrote
this out into as simple

578
00:28:56,658 --> 00:29:01,414
as possible, where each node
only had up to two inputs,

579
00:29:01,414 --> 00:29:04,499
but yeah, you could definitely do that.

580
00:29:04,499 --> 00:29:07,082
Any other questions about this?

581
00:29:08,682 --> 00:29:11,399
Okay, so the one thing that I really like

582
00:29:11,399 --> 00:29:13,681
about thinking about this
like a computational graph

583
00:29:13,681 --> 00:29:15,406
is that I feel very comforted, right,

584
00:29:15,406 --> 00:29:17,934
like anytime I have to take a gradient,

585
00:29:17,934 --> 00:29:20,253
find gradients of something,
even if the expression

586
00:29:20,253 --> 00:29:22,960
that I want to compute
gradients of is really hairy,

587
00:29:22,960 --> 00:29:24,950
and really scary, you know,
whether it's something like

588
00:29:24,950 --> 00:29:26,939
this sigmoid or something worse,

589
00:29:26,939 --> 00:29:30,702
I know that, you know, I could
derive this if I want to,

590
00:29:30,702 --> 00:29:33,214
but really, if I just
sit down and write it out

591
00:29:33,214 --> 00:29:35,024
in terms of a computational graph,

592
00:29:35,024 --> 00:29:37,033
I can go as simple as I need to

593
00:29:37,033 --> 00:29:40,405
to always be able to apply
backprop and the chain rule,

594
00:29:40,405 --> 00:29:44,058
and be able to compute all
the gradients that I need.

595
00:29:44,058 --> 00:29:46,623
And so this is something that
you guys should think about

596
00:29:46,623 --> 00:29:51,204
when you're doing your homeworks,
as basically, you know,

597
00:29:51,204 --> 00:29:53,438
anytime you're having trouble
finding gradients of something

598
00:29:53,438 --> 00:29:55,474
just think about it as
a computational graph,

599
00:29:55,474 --> 00:29:57,285
break it down into all of these parts,

600
00:29:57,285 --> 00:29:59,618
and then use the chain rule.

601
00:30:00,884 --> 00:30:04,040
Okay, and so, you know, so we talked about

602
00:30:04,040 --> 00:30:06,747
how we could group these
set of nodes together

603
00:30:06,747 --> 00:30:10,558
into a sigmoid gate, and
just to confirm, like,

604
00:30:10,558 --> 00:30:12,465
that this is actually exactly equivalent,

605
00:30:12,465 --> 00:30:14,098
we can plug this in, right?

606
00:30:14,098 --> 00:30:18,182
So we have that our input
here to the sigmoid gate

607
00:30:18,182 --> 00:30:21,717
is going to be one, in
green, and then we have

608
00:30:21,717 --> 00:30:25,800
that the output is going
to be here, 0.73, right,

609
00:30:26,745 --> 00:30:28,208
and this'll work out if you plug

610
00:30:28,208 --> 00:30:29,943
it in to the sigmoid function.

611
00:30:29,943 --> 00:30:33,012
And so now if we want to
do, if we want to take

612
00:30:33,012 --> 00:30:36,315
the gradient, and we want
to treat this entire sigmoid

613
00:30:36,315 --> 00:30:39,525
as one node, now what we should do

614
00:30:39,525 --> 00:30:41,219
is we need to use this local gradient

615
00:30:41,219 --> 00:30:42,833
that we've derived up here, right?

616
00:30:42,833 --> 00:30:45,962
One minus sigmoid of x
times the sigmoid of x.

617
00:30:45,962 --> 00:30:48,894
So if we plug this in, and here we know

618
00:30:48,894 --> 00:30:51,612
that the value of sigmoid of x was 0.73,

619
00:30:51,612 --> 00:30:54,267
so if we plug this value
in we'll see that this,

620
00:30:54,267 --> 00:30:57,543
the value of this gradient
is equal to 0.2, right,

621
00:30:57,543 --> 00:31:00,206
and so the value of this
local gradient is 0.2,

622
00:31:00,206 --> 00:31:03,466
we multiply it by the x
upstream gradient which is one,

623
00:31:03,466 --> 00:31:07,044
and we're going to get
out exactly the same value

624
00:31:07,044 --> 00:31:09,267
of the gradient with respect
to before the sigmoid gate,

625
00:31:09,267 --> 00:31:13,434
as if we broke it down into all
of the smaller computations.

626
00:31:17,191 --> 00:31:19,325
Okay, and so as we're looking
at what's happening, right,

627
00:31:19,325 --> 00:31:23,714
as we're taking these
gradients going backwards

628
00:31:23,714 --> 00:31:25,168
through our computational graph,

629
00:31:25,168 --> 00:31:28,790
there's some patterns that you'll notice

630
00:31:28,790 --> 00:31:31,228
where there's some
intuitive interpretation

631
00:31:31,228 --> 00:31:33,058
that we can give these, right?

632
00:31:33,058 --> 00:31:37,214
So we saw that the add gate is
a gradient distributor right,

633
00:31:37,214 --> 00:31:41,129
when we passed through
this addition gate here,

634
00:31:41,129 --> 00:31:43,930
which had two branches coming out of it,

635
00:31:43,930 --> 00:31:46,103
it took the gradient,
the upstream gradient

636
00:31:46,103 --> 00:31:48,586
and it just distributed it,
passed the exact same thing

637
00:31:48,586 --> 00:31:51,947
to both of the branches
that were connected.

638
00:31:51,947 --> 00:31:55,156
So here's a couple more
that we can think about.

639
00:31:55,156 --> 00:31:57,739
So what's a max gate look like?

640
00:31:59,214 --> 00:32:01,310
So we have a max gate
here at the bottom, right,

641
00:32:01,310 --> 00:32:04,825
where the input's coming in are z and w,

642
00:32:04,825 --> 00:32:08,665
z has a value of two, w has
a value of negative one,

643
00:32:08,665 --> 00:32:11,444
and then we took the max of
this, which is two, right,

644
00:32:11,444 --> 00:32:14,251
and so we pass this
down into the remainder

645
00:32:14,251 --> 00:32:16,877
of our computational graph.

646
00:32:16,877 --> 00:32:20,068
So now if we're taking the
gradients with respect to this,

647
00:32:20,068 --> 00:32:23,223
the upstream gradient is, let's
say two coming back, right,

648
00:32:23,223 --> 00:32:26,890
and what does this local
gradient look like?

649
00:32:30,057 --> 00:32:31,753
So anyone, yes?

650
00:32:31,753 --> 00:32:35,011
- [Student] It'll be zero for
one, and one for the other?

651
00:32:35,011 --> 00:32:35,862
- Right.

652
00:32:35,862 --> 00:32:39,435
[student speaking away from microphone]

653
00:32:39,435 --> 00:32:42,607
Exactly, so the answer that was given

654
00:32:42,607 --> 00:32:45,873
is that z will have a gradient of two,

655
00:32:45,873 --> 00:32:49,073
w will have a value, a gradient of zero,

656
00:32:49,073 --> 00:32:50,931
and so one of these is going to get

657
00:32:50,931 --> 00:32:53,112
the full value of the
gradient just passed back,

658
00:32:53,112 --> 00:32:57,140
and routed to that variable,
and then the other one

659
00:32:57,140 --> 00:33:00,223
will have a gradient of zero, and so,

660
00:33:01,328 --> 00:33:03,472
so we can think of this as kind
of a gradient router, right,

661
00:33:03,472 --> 00:33:05,949
so, whereas the addition node passed back

662
00:33:05,949 --> 00:33:08,574
the same gradient to
both branches coming in,

663
00:33:08,574 --> 00:33:10,154
the max gate will just take the gradient

664
00:33:10,154 --> 00:33:12,630
and route it to one of the branches,

665
00:33:12,630 --> 00:33:15,344
and this makes sense because
if we look at our forward pass,

666
00:33:15,344 --> 00:33:16,888
what's happening is that only the value

667
00:33:16,888 --> 00:33:19,393
that was the maximum got passed down to

668
00:33:19,393 --> 00:33:22,594
the rest of the
computational graph, right?

669
00:33:22,594 --> 00:33:25,022
So it's the only value
that actually affected

670
00:33:25,022 --> 00:33:27,647
our function computation at
the end, and so it makes sense

671
00:33:27,647 --> 00:33:29,443
that when we're passing
our gradients back,

672
00:33:29,443 --> 00:33:32,610
we just want to adjust what, you know,

673
00:33:33,544 --> 00:33:38,270
flow it through that
branch of the computation.

674
00:33:38,270 --> 00:33:40,705
Okay, and so another one,
what's a multiplication gate,

675
00:33:40,705 --> 00:33:44,872
which we saw earlier, is there
any interpretation of this?

676
00:33:46,028 --> 00:33:50,195
[student speaking away from microphone]

677
00:33:52,976 --> 00:33:55,445
Okay, so the answer that was given

678
00:33:55,445 --> 00:33:57,282
is that the local
gradient is basically just

679
00:33:57,282 --> 00:33:59,408
the value of the other variable.

680
00:33:59,408 --> 00:34:01,407
Yeah, so that's exactly right.

681
00:34:01,407 --> 00:34:04,712
So we can think of this as
a gradient switcher, right?

682
00:34:04,712 --> 00:34:06,409
A switcher, and I guess
a scaler, where we take

683
00:34:06,409 --> 00:34:08,802
the upstream gradient and we scale it by

684
00:34:08,802 --> 00:34:11,302
the value of the other branch.

685
00:34:15,719 --> 00:34:17,559
Okay, and so one other thing to note is

686
00:34:17,559 --> 00:34:20,205
that when we have a place where one node

687
00:34:20,205 --> 00:34:22,592
is connected to multiple nodes,

688
00:34:22,592 --> 00:34:26,187
the gradients add up at this node, right?

689
00:34:26,187 --> 00:34:29,804
So at these branches, using
the multivariate chain rule,

690
00:34:29,804 --> 00:34:31,520
we're just going to take the value of

691
00:34:31,520 --> 00:34:35,274
the upstream gradient coming
back from each of these nodes,

692
00:34:35,274 --> 00:34:37,364
and we'll add these
together to get the total

693
00:34:37,364 --> 00:34:40,499
upstream gradient that's
flowing back into this node,

694
00:34:40,500 --> 00:34:43,005
and you can see this from
the multivariate chain rule

695
00:34:43,005 --> 00:34:47,046
and also thinking about this,
you can think about this

696
00:34:47,047 --> 00:34:49,847
that if you're going to
change this node a little bit,

697
00:34:49,847 --> 00:34:52,781
it's going to affect both
of these connected nodes

698
00:34:52,781 --> 00:34:54,949
in the forward pass,
right, when you're making

699
00:34:54,949 --> 00:34:57,263
your forward pass through the graph.

700
00:34:57,263 --> 00:34:59,478
And so then when you're
doing backprop, right,

701
00:34:59,478 --> 00:35:03,803
then now the, both of
these gradients coming back

702
00:35:03,803 --> 00:35:06,015
are going to affect this node, right,

703
00:35:06,015 --> 00:35:07,872
and so that's how we're
going to sum these up to be

704
00:35:07,872 --> 00:35:12,725
the total upstream gradient
flowing back into this node.

705
00:35:12,725 --> 00:35:15,892
Okay, so any questions about backprop,

706
00:35:18,439 --> 00:35:22,333
going through these forward
and backward passes?

707
00:35:22,333 --> 00:35:23,429
- [Student] So we haven't did anything

708
00:35:23,429 --> 00:35:25,490
to actually update the weights.

709
00:35:25,490 --> 00:35:29,072
[speaking away from microphone]

710
00:35:29,072 --> 00:35:31,418
- Right, so the question is,
we haven't done anything yet

711
00:35:31,418 --> 00:35:33,284
to update the values of these weights,

712
00:35:33,284 --> 00:35:35,994
we've only found the
gradients with respect

713
00:35:35,994 --> 00:35:37,867
to the variables, that's exactly right.

714
00:35:37,867 --> 00:35:41,206
So what we've talked about
so far in this lecture is how

715
00:35:41,206 --> 00:35:45,070
to compute gradients with
respect to any variables

716
00:35:45,070 --> 00:35:48,210
in our function, right,
and then once we have these

717
00:35:48,210 --> 00:35:51,258
we can just apply everything we learned in

718
00:35:51,258 --> 00:35:54,419
the optimization lecture,
last lecture, right?

719
00:35:54,419 --> 00:35:57,363
So given the gradient,
we now take a step in

720
00:35:57,363 --> 00:35:59,291
the direction of the gradient in order

721
00:35:59,291 --> 00:36:02,391
to update our weight,
our parameters, right?

722
00:36:02,391 --> 00:36:04,331
So you can just take this entire framework

723
00:36:04,331 --> 00:36:07,049
that we learned about last
lecture for optimization,

724
00:36:07,049 --> 00:36:11,196
and what we've done here is
just learn how to compute

725
00:36:11,196 --> 00:36:14,747
the gradients we need for
arbitrarily complex functions,

726
00:36:14,747 --> 00:36:16,920
right, and so this is going
to be useful when we talk

727
00:36:16,920 --> 00:36:20,039
about complex functions like
neural networks later on.

728
00:36:20,039 --> 00:36:21,083
Yeah?

729
00:36:21,083 --> 00:36:22,667
- [Student] Do you mind writing out the,

730
00:36:22,667 --> 00:36:24,453
all the variate, so you could help explain

731
00:36:24,453 --> 00:36:27,152
this slide a little better?

732
00:36:27,152 --> 00:36:31,069
- Yeah, so I can write
this maybe on the board.

733
00:36:32,851 --> 00:36:37,430
Right, so basically if we're
going to have, let's see,

734
00:36:37,430 --> 00:36:39,544
if we're going to have the gradient of f

735
00:36:39,544 --> 00:36:41,904
with respect to some variable x, right,

736
00:36:41,904 --> 00:36:45,821
and let's say it's
connected through variables,

737
00:36:49,203 --> 00:36:51,953
let's see, i, we can basically...

738
00:37:01,680 --> 00:37:03,311
Right, so this is basically saying

739
00:37:03,311 --> 00:37:07,436
that if x is connected to
these multiple elements, right,

740
00:37:07,436 --> 00:37:10,159
which in this case, different q-is,

741
00:37:10,159 --> 00:37:12,992
then the chain rule is taking all,

742
00:37:14,228 --> 00:37:16,453
it's going to take the effect of each of

743
00:37:16,453 --> 00:37:18,494
these intermediate variables, right,

744
00:37:18,494 --> 00:37:22,897
on our final output f, and
then compound each one with

745
00:37:22,897 --> 00:37:26,447
the local effect of our variable x on

746
00:37:26,447 --> 00:37:29,127
that intermediate value, right?

747
00:37:29,127 --> 00:37:33,294
So yeah, it's basically just
summing all these up together.

748
00:37:35,755 --> 00:37:39,525
Okay, so now that we've, you
know, done all these examples

749
00:37:39,525 --> 00:37:42,203
in the scalar case, we're going to look at

750
00:37:42,203 --> 00:37:44,720
what happens when we have vectors, right?

751
00:37:44,720 --> 00:37:47,054
So now if our variables x, y and z,

752
00:37:47,054 --> 00:37:50,250
instead of just being numbers,
we have vectors for these.

753
00:37:50,250 --> 00:37:53,877
And so everything stays exactly
the same, the entire flow,

754
00:37:53,877 --> 00:37:56,236
the only difference is
that now our gradients

755
00:37:56,236 --> 00:37:59,428
are going to be Jacobian matrices, right,

756
00:37:59,428 --> 00:38:04,014
so these are now going
to be matrices containing

757
00:38:04,014 --> 00:38:07,741
the derivative of each
element of, for example z

758
00:38:07,741 --> 00:38:10,574
with respect to each element of x.

759
00:38:14,237 --> 00:38:18,355
Okay, and so to, you
know, so give an example

760
00:38:18,355 --> 00:38:20,821
of something where this is
happening, right, let's say

761
00:38:20,821 --> 00:38:24,049
that we have our input is
going to now be a vector,

762
00:38:24,049 --> 00:38:27,621
so let's say we have a
4096-dimensional input vector,

763
00:38:27,621 --> 00:38:29,797
and this is kind of a common
size that you might see

764
00:38:29,797 --> 00:38:33,842
in convolutional neural networks later on,

765
00:38:33,842 --> 00:38:37,918
and our node is going to be an
element-wise maximum, right?

766
00:38:37,918 --> 00:38:41,128
So we have f of x is equal to the maximum

767
00:38:41,128 --> 00:38:45,295
of x compared with zero
element-wise, and then our output

768
00:38:46,875 --> 00:38:50,708
is going to be also a
4096-dimensional vector.

769
00:38:53,625 --> 00:38:55,351
Okay, so in this case, what's the size

770
00:38:55,351 --> 00:38:56,969
of our Jacobian matrix?

771
00:38:56,969 --> 00:38:59,281
Remember I said earlier,
the Jacobian matrix

772
00:38:59,281 --> 00:39:01,903
is going to be, like each row is,

773
00:39:01,903 --> 00:39:03,628
it's going to be partial derivatives,

774
00:39:03,628 --> 00:39:06,229
a matrix of partial derivatives
of each dimension of

775
00:39:06,229 --> 00:39:10,396
the output with respect to
each dimension of the input.

776
00:39:11,705 --> 00:39:15,230
Okay, so the answer I
heard was 4,096 squared,

777
00:39:15,230 --> 00:39:17,572
and that's, yeah, that's correct.

778
00:39:17,572 --> 00:39:21,405
So this is pretty large,
right, 4,096 by 4,096

779
00:39:22,261 --> 00:39:25,203
and in practice this is
going to be even larger

780
00:39:25,203 --> 00:39:27,400
because we're going to
work with many batches

781
00:39:27,400 --> 00:39:30,692
of, you know, of, for example, 100 inputs

782
00:39:30,692 --> 00:39:33,187
at the same time, right,
and we'll put all of these

783
00:39:33,187 --> 00:39:36,205
through our node at the same
time to be more efficient,

784
00:39:36,205 --> 00:39:38,739
and so this is going to scale this by 100,

785
00:39:38,739 --> 00:39:40,595
and in practice our Jacobian's
actually going to turn out

786
00:39:40,595 --> 00:39:45,082
to be something like
409,000 by 409,000 right,

787
00:39:45,082 --> 00:39:48,499
so this is really huge, and basically

788
00:39:48,499 --> 00:39:50,750
completely impractical to work with.

789
00:39:50,750 --> 00:39:53,633
So in practice though,
we don't actually need

790
00:39:53,633 --> 00:39:57,507
to compute this huge
Jacobian most of the time,

791
00:39:57,507 --> 00:39:59,149
and so why is that, like, what does

792
00:39:59,149 --> 00:40:00,902
this Jacobian matrix look like?

793
00:40:00,902 --> 00:40:04,509
If we think about what's happening here,

794
00:40:04,509 --> 00:40:06,777
where we're taking this
element-wise maximum,

795
00:40:06,777 --> 00:40:08,312
and we think about what are each of

796
00:40:08,312 --> 00:40:10,899
the partial derivatives, right,

797
00:40:10,899 --> 00:40:13,888
which dimension of the inputs affect

798
00:40:13,888 --> 00:40:15,706
which dimensions of the output?

799
00:40:15,706 --> 00:40:19,686
What sort of structure can we
see in our Jacobian matrix?

800
00:40:19,686 --> 00:40:21,198
[student speaking away from microphone]

801
00:40:21,198 --> 00:40:23,869
Okay, so I heard that it's
diagonal, right, exactly.

802
00:40:23,869 --> 00:40:27,703
So because this is element-wise,
right, each element of

803
00:40:27,703 --> 00:40:31,109
the input, say the first
dimension, only affects

804
00:40:31,109 --> 00:40:34,741
that corresponding element
in the output, right?

805
00:40:34,741 --> 00:40:38,014
And so because of that
our Jacobian matrix,

806
00:40:38,014 --> 00:40:41,567
which is just going to
be a diagonal matrix.

807
00:40:41,567 --> 00:40:43,924
And so in practice then,
we don't actually have

808
00:40:43,924 --> 00:40:47,198
to write out and formulate
this entire Jacobian,

809
00:40:47,198 --> 00:40:51,365
we can just know the effect
of x on the output, right,

810
00:40:54,986 --> 00:40:57,883
and then we can just
use these values, right,

811
00:40:57,883 --> 00:41:01,800
and fill it in as we're
computing the gradient.

812
00:41:04,100 --> 00:41:06,716
Okay, so now we're going to go through

813
00:41:06,716 --> 00:41:10,693
a more concrete vectorized
example of a computational graph.

814
00:41:10,693 --> 00:41:11,724
Right, so let's look at a case

815
00:41:11,724 --> 00:41:15,065
where we have the function f of x and W

816
00:41:15,065 --> 00:41:19,232
is equal to, basically the
L-two of W multiplied by x,

817
00:41:22,407 --> 00:41:24,013
and so in this case we're going to say x

818
00:41:24,013 --> 00:41:26,763
is n-dimensional and W is n by n.

819
00:41:29,076 --> 00:41:30,563
Right, so again our first step,

820
00:41:30,563 --> 00:41:32,635
writing out the
computational graph, right?

821
00:41:32,635 --> 00:41:35,303
We have W multiplied by
x, and then followed by,

822
00:41:35,303 --> 00:41:37,886
I'm just going to call this L-two.

823
00:41:39,520 --> 00:41:42,822
And so now let's also fill
out some values for this,

824
00:41:42,822 --> 00:41:45,382
so we can see that, you
know, let's say have W be

825
00:41:45,382 --> 00:41:47,560
this two by two matrix, and x

826
00:41:47,560 --> 00:41:50,632
is going to be this
two-dimensional vector, right?

827
00:41:50,632 --> 00:41:53,949
And so we can say, label
again our intermediate nodes.

828
00:41:53,949 --> 00:41:56,653
So our intermediate node
after the multiplication

829
00:41:56,653 --> 00:42:00,230
it's going to be q, we
have q equals W times x,

830
00:42:00,230 --> 00:42:02,666
which we can write out
element-wise this way,

831
00:42:02,666 --> 00:42:05,632
where the first element is
just W-one-one times x-one

832
00:42:05,632 --> 00:42:08,904
plus W-one-two times x-two and so on,

833
00:42:08,904 --> 00:42:12,611
and then we can now express
f in relation to q, right?

834
00:42:12,611 --> 00:42:15,508
So looking at the second
node we have f of q

835
00:42:15,508 --> 00:42:18,175
is equal to the L-two norm of q,

836
00:42:19,260 --> 00:42:24,218
which is equal to q-one
squared plus q-two squared.

837
00:42:24,218 --> 00:42:25,923
Okay, so we filled this in, right,

838
00:42:25,923 --> 00:42:29,423
we get q and then we get our final output.

839
00:42:30,491 --> 00:42:33,218
Okay, so now let's do
backprop through this, right?

840
00:42:33,218 --> 00:42:36,333
So again, this is always the first step,

841
00:42:36,333 --> 00:42:40,500
we have the gradient with respect
to our output is just one.

842
00:42:44,084 --> 00:42:47,505
Okay, so now let's move back one node,

843
00:42:47,505 --> 00:42:50,902
so now we want to find the
gradient with respect to q,

844
00:42:50,902 --> 00:42:55,493
right, our intermediate
variable before the L-two.

845
00:42:55,493 --> 00:42:58,719
And so q is a two-dimensional vector,

846
00:42:58,719 --> 00:43:00,065
and what we want to do is we want

847
00:43:00,065 --> 00:43:04,232
to find how each element of q
affects our final value of f,

848
00:43:07,918 --> 00:43:09,586
right, and so if we
look at this expression

849
00:43:09,586 --> 00:43:11,955
that we've written out
for f here at the bottom,

850
00:43:11,955 --> 00:43:14,705
we can see that the gradient of f

851
00:43:15,644 --> 00:43:19,293
with respect to a specific
q-i, let's say q-one,

852
00:43:19,293 --> 00:43:23,136
is just going to be two times q-i, right?

853
00:43:23,136 --> 00:43:26,569
This is just taking this derivative here,

854
00:43:26,569 --> 00:43:30,293
and so we have this expression for,

855
00:43:30,293 --> 00:43:32,453
with respect to each element of q-i,

856
00:43:32,453 --> 00:43:34,606
we could also, you know, write this out

857
00:43:34,606 --> 00:43:36,944
in vector form if we want to,

858
00:43:36,944 --> 00:43:39,869
it's just going to be two
times our vector of q, right,

859
00:43:39,869 --> 00:43:41,719
if we want to write
this out in vector form,

860
00:43:41,719 --> 00:43:45,698
and so what we get is
that our gradient is 0.44,

861
00:43:45,698 --> 00:43:48,226
and 0.52, this vector, right?

862
00:43:48,226 --> 00:43:50,656
And so you can see that it just took q

863
00:43:50,656 --> 00:43:52,004
and it scaled it by two, right?

864
00:43:52,004 --> 00:43:55,192
Each element is just multiplied by two.

865
00:43:55,192 --> 00:43:58,772
So the gradient of a vector
is always going to be

866
00:43:58,772 --> 00:44:01,182
the same size as the original vector,

867
00:44:01,182 --> 00:44:05,015
and each element of this
gradient is going to,

868
00:44:06,530 --> 00:44:09,132
it means how much of
this particular element

869
00:44:09,132 --> 00:44:12,549
affects our final output of the function.

870
00:44:16,126 --> 00:44:18,920
Okay, so now let's move
one step backwards, right,

871
00:44:18,920 --> 00:44:22,523
what's the gradient with respect to W?

872
00:44:22,523 --> 00:44:25,639
And so here again we want
to use the same concept

873
00:44:25,639 --> 00:44:26,998
of trying to apply the chain rule, right,

874
00:44:26,998 --> 00:44:29,129
so we want to compute our local gradient

875
00:44:29,129 --> 00:44:31,328
of q with respect to W, and so let's look

876
00:44:31,328 --> 00:44:34,683
at this again element-wise,
and if we do that,

877
00:44:34,683 --> 00:44:37,884
let's see what's the
effect of each q, right,

878
00:44:37,884 --> 00:44:41,636
each element of q with
respect to each element of W,

879
00:44:41,636 --> 00:44:43,106
and so this is going to be the Jacobian

880
00:44:43,106 --> 00:44:47,592
that we talked about earlier,
and if we look at this

881
00:44:47,592 --> 00:44:49,509
in this multiplication, q is equal

882
00:44:49,509 --> 00:44:53,092
to W times x, right,
what's the derivative,

883
00:44:56,236 --> 00:44:59,197
or the gradient of the first element of q,

884
00:44:59,197 --> 00:45:03,962
so our first element up top,
with respect to W-one-one?

885
00:45:03,962 --> 00:45:07,470
So q-one with respect to W-one-one?

886
00:45:07,470 --> 00:45:09,157
What's that value?

887
00:45:09,157 --> 00:45:10,851
X-one, exactly.

888
00:45:10,851 --> 00:45:14,068
Yeah, so we know that this is x-one,

889
00:45:14,068 --> 00:45:17,659
and we can write this
out more generally of

890
00:45:17,659 --> 00:45:21,826
the gradient of q-k with respect
to W-i,j is equal to X-j.

891
00:45:24,313 --> 00:45:26,111
And then now if we want
to find the gradient

892
00:45:26,111 --> 00:45:30,278
with respect to, of f,
with respect to each W-i,j.

893
00:45:31,523 --> 00:45:35,332
So looking at these derivatives now,

894
00:45:35,332 --> 00:45:38,339
we can use this chain rule
that we talked earlier

895
00:45:38,339 --> 00:45:41,672
where we basically compound df over dq-k

896
00:45:43,770 --> 00:45:47,270
for each element of q with dq-k over W-i,j

897
00:45:48,934 --> 00:45:51,498
for each element of W-i,j, right?

898
00:45:51,498 --> 00:45:53,563
So we find the effect of each element of W

899
00:45:53,563 --> 00:45:57,649
on each element of q, and
sum this across all q.

900
00:45:57,649 --> 00:45:59,653
And so if you write this
out, this is going to give

901
00:45:59,653 --> 00:46:03,236
this expression of two
times q-i times x-j.

902
00:46:05,898 --> 00:46:08,744
Okay, and so filling this out then we get

903
00:46:08,744 --> 00:46:11,564
this gradient with respect to W,

904
00:46:11,564 --> 00:46:14,270
and so again we can compute
this each element-wise,

905
00:46:14,270 --> 00:46:17,360
or we can also look at this
expression that we've derived

906
00:46:17,360 --> 00:46:20,943
and write it out in
vectorized form, right?

907
00:46:22,028 --> 00:46:24,827
So okay, and remember, the important thing

908
00:46:24,827 --> 00:46:28,484
is always to check the gradient
with respect to a variable

909
00:46:28,484 --> 00:46:31,137
should have the same shape as
the variable, and something,

910
00:46:31,137 --> 00:46:33,665
so this is something
really useful in practice

911
00:46:33,665 --> 00:46:36,108
to sanity check, right,
like once you've computed

912
00:46:36,108 --> 00:46:38,042
what your gradient should
be, check that this is

913
00:46:38,042 --> 00:46:40,709
the same shape as your variable,

914
00:46:42,710 --> 00:46:46,220
because again, the element,
each element of your gradient

915
00:46:46,220 --> 00:46:49,176
is quantifying how much that element

916
00:46:49,176 --> 00:46:53,343
is contributing to your, is
affecting your final output.

917
00:46:54,177 --> 00:46:55,010
Yeah?

918
00:46:55,010 --> 00:46:57,489
[student speaking away from microphone]

919
00:46:57,489 --> 00:46:59,023
The both sides, oh the both sides one

920
00:46:59,023 --> 00:47:01,427
is an indicator function,
so this is saying

921
00:47:01,427 --> 00:47:04,177
that it's just one if k equals i.

922
00:47:08,276 --> 00:47:11,571
Okay, so let's see, so we've done that,

923
00:47:11,571 --> 00:47:14,738
and so now just see, one more example.

924
00:47:16,647 --> 00:47:18,094
Now our last thing we need to find is

925
00:47:18,094 --> 00:47:21,031
the gradient with respect to q-I.

926
00:47:21,031 --> 00:47:24,824
So here if we compute the
partial derivatives we can see

927
00:47:24,824 --> 00:47:28,574
that dq-k over dx-i is
equal to W-k,i, right,

928
00:47:29,535 --> 00:47:32,702
using the same way as we did it for W,

929
00:47:34,833 --> 00:47:38,702
and then again we can
just use the chain rule

930
00:47:38,702 --> 00:47:43,127
and get the total
expression for that, right?

931
00:47:43,127 --> 00:47:44,902
And so this is going to be the gradient

932
00:47:44,902 --> 00:47:48,350
with respect to x, again,
of the same shape as x,

933
00:47:48,350 --> 00:47:49,522
and we can also write this out

934
00:47:49,522 --> 00:47:52,022
in vectorized form if we want.

935
00:47:53,529 --> 00:47:56,442
Okay, so any questions about this, yeah?

936
00:47:56,442 --> 00:47:59,100
[student speaking away from microphone]

937
00:47:59,100 --> 00:48:01,850
So we are computing the Jacobian,

938
00:48:03,194 --> 00:48:05,694
so let me go back here, right,

939
00:48:07,414 --> 00:48:10,774
so if we're doing, so right, so we have

940
00:48:10,774 --> 00:48:12,873
these partial derivatives of q-k

941
00:48:12,873 --> 00:48:16,439
with respect to x-i, right, and these

942
00:48:16,439 --> 00:48:20,714
are forming your, the entries
of your Jacobian, right?

943
00:48:20,714 --> 00:48:22,879
And so in practice what we're going to do

944
00:48:22,879 --> 00:48:26,306
is we basically take that,
and you're going to see it

945
00:48:26,306 --> 00:48:29,222
up there in the chain rule,
so the vectorized expression

946
00:48:29,222 --> 00:48:31,259
of gradient with respect to x, right,

947
00:48:31,259 --> 00:48:33,293
this is going to have the Jacobian here

948
00:48:33,293 --> 00:48:36,332
which is this transposed value here,

949
00:48:36,332 --> 00:48:38,926
so you can write it
out in vectorized form.

950
00:48:38,926 --> 00:48:43,489
[student speaking away from microphone]

951
00:48:43,489 --> 00:48:44,923
So well, so in this case the matrix

952
00:48:44,923 --> 00:48:47,636
is going to be the same size as W right,

953
00:48:47,636 --> 00:48:51,803
so it's not actually a large
matrix in this case, right?

954
00:48:55,878 --> 00:48:58,941
Okay, so the way that we've
been thinking about this

955
00:48:58,941 --> 00:49:01,764
is like a really modularized
implementation, right,

956
00:49:01,764 --> 00:49:05,305
where in our computational graph, right,

957
00:49:05,305 --> 00:49:07,678
we look at each node
locally and we compute

958
00:49:07,678 --> 00:49:09,046
the local gradients and chain them

959
00:49:09,046 --> 00:49:10,965
with upstream gradients coming down,

960
00:49:10,965 --> 00:49:13,181
and so you can think of this as basically

961
00:49:13,181 --> 00:49:15,337
a forward and a backwards API, right?

962
00:49:15,337 --> 00:49:19,186
In the forward pass we
implement the, you know,

963
00:49:19,186 --> 00:49:22,396
a function computing
the output of this node,

964
00:49:22,396 --> 00:49:25,011
and then in the backwards
pass we compute the gradient.

965
00:49:25,011 --> 00:49:27,295
And so when we actually
implement this in code,

966
00:49:27,295 --> 00:49:30,592
we're going to do this
in exactly the same way.

967
00:49:30,592 --> 00:49:34,865
So we can basically think
about, for each gate, right,

968
00:49:34,865 --> 00:49:38,464
if we implement a forward
function and a backward function,

969
00:49:38,464 --> 00:49:40,908
where the backward function
is computing the chain rule,

970
00:49:40,908 --> 00:49:45,572
then if we have our entire
graph, we can just make

971
00:49:45,572 --> 00:49:48,833
a forward pass through the
entire graph by iterating through

972
00:49:48,833 --> 00:49:51,059
all the nodes in the graph, all the gates.

973
00:49:51,059 --> 00:49:52,399
Here I'm going to use
the word gate and node,

974
00:49:52,399 --> 00:49:54,768
kind of interchangeably,
we can iterate through

975
00:49:54,768 --> 00:49:57,194
all of these gates and just call forward

976
00:49:57,194 --> 00:49:58,803
on each of the gates, right?

977
00:49:58,803 --> 00:50:01,349
And we just want to do this
in topologically sorted order,

978
00:50:01,349 --> 00:50:03,357
so we process all of
the inputs coming in to

979
00:50:03,357 --> 00:50:06,332
a node before we process that node.

980
00:50:06,332 --> 00:50:08,576
And then going backwards,
we're just going to then

981
00:50:08,576 --> 00:50:11,565
go through all of the gates
in this reverse sorted order,

982
00:50:11,565 --> 00:50:15,482
and then call backwards
on each of these gates.

983
00:50:16,564 --> 00:50:19,147
Okay, and so if we look at then

984
00:50:20,225 --> 00:50:22,285
the implementation for
our particular gates,

985
00:50:22,285 --> 00:50:24,596
so for example, this MultiplyGate here,

986
00:50:24,596 --> 00:50:26,960
we want to implement
the forward pass, right,

987
00:50:26,960 --> 00:50:31,127
so it gets x and y as inputs,
and returns the value of z,

988
00:50:32,520 --> 00:50:34,162
and then when we go backwards, right,

989
00:50:34,162 --> 00:50:37,216
we get as input dz, which
is our upstream gradient,

990
00:50:37,216 --> 00:50:40,167
and we want to output the gradients

991
00:50:40,167 --> 00:50:42,523
on the input's x and
y to pass down, right?

992
00:50:42,523 --> 00:50:45,190
So we're going to output dx and dy,

993
00:50:46,426 --> 00:50:49,110
and so in this case, in this example,

994
00:50:49,110 --> 00:50:51,567
everything is back to
the scalar case here,

995
00:50:51,567 --> 00:50:55,574
and so if we look at
this in the forward pass,

996
00:50:55,574 --> 00:50:57,215
one thing that's important
is that we need to,

997
00:50:57,215 --> 00:50:59,327
we should cache the values
of the forward pass, right,

998
00:50:59,327 --> 00:51:01,043
because we end up using this in

999
00:51:01,043 --> 00:51:03,156
the backward pass a lot of the time.

1000
00:51:03,156 --> 00:51:06,061
So here in the forward pass,
we want to cache the values

1001
00:51:06,061 --> 00:51:10,594
of x and y, right, and
in the backward pass,

1002
00:51:10,594 --> 00:51:12,930
using the chain rule,
we're going to, remember,

1003
00:51:12,930 --> 00:51:14,348
take the value of the upstream gradient

1004
00:51:14,348 --> 00:51:17,345
and scale it by the value
of the other branch, right,

1005
00:51:17,345 --> 00:51:20,294
and so we'll keep, for
dx we'll take our value

1006
00:51:20,294 --> 00:51:22,960
of self.y that we kept, and multiply it

1007
00:51:22,960 --> 00:51:25,877
by dz coming down, and same for dy.

1008
00:51:28,799 --> 00:51:32,879
Okay, so if you look at a lot
of deep-learning frameworks

1009
00:51:32,879 --> 00:51:35,358
and libraries you'll see
that they exactly follow

1010
00:51:35,358 --> 00:51:37,873
this kind of modularization, right?

1011
00:51:37,873 --> 00:51:42,040
So for example, Caffe is a
popular deep learning framework,

1012
00:51:43,284 --> 00:51:46,047
and you'll see, if you go look
through the Caffe source code

1013
00:51:46,047 --> 00:51:48,351
you'll get to some
directory that says layers,

1014
00:51:48,351 --> 00:51:52,443
and in layers, which are
basically computational nodes,

1015
00:51:52,443 --> 00:51:54,400
usually layers might be
slightly more, you know,

1016
00:51:54,400 --> 00:51:56,217
some of these more complex
computational nodes

1017
00:51:56,217 --> 00:51:58,671
like the sigmoid that
we talked about earlier,

1018
00:51:58,671 --> 00:52:01,555
you'll see, basically just a whole list

1019
00:52:01,555 --> 00:52:04,342
of all different kinds of
computational nodes, right?

1020
00:52:04,342 --> 00:52:06,132
So you might have the sigmoid, and I know

1021
00:52:06,132 --> 00:52:09,205
there might be here, there's
like a convolution is one,

1022
00:52:09,205 --> 00:52:11,925
there's an Argmax is another layer,

1023
00:52:11,925 --> 00:52:14,055
you'll have all of these
layers and if you dig in

1024
00:52:14,055 --> 00:52:16,340
to each of them, they're
just exactly implementing

1025
00:52:16,340 --> 00:52:18,355
a forward pass and a backward pass,

1026
00:52:18,355 --> 00:52:20,929
and then all of these are called

1027
00:52:20,929 --> 00:52:23,072
when we do forward and backward pass

1028
00:52:23,072 --> 00:52:25,130
through the entire network that we formed,

1029
00:52:25,130 --> 00:52:27,372
and so our network is just basically going

1030
00:52:27,372 --> 00:52:30,393
to be stacking up all of these,

1031
00:52:30,393 --> 00:52:34,467
the different layers that we
choose to use in the network.

1032
00:52:34,467 --> 00:52:37,560
So for example, if we
look at a specific one,

1033
00:52:37,560 --> 00:52:40,613
in this case a sigmoid layer, you'll see

1034
00:52:40,613 --> 00:52:42,281
that in the sigmoid layer, right,

1035
00:52:42,281 --> 00:52:44,658
we've talked about the sigmoid function,

1036
00:52:44,658 --> 00:52:47,140
you'll see that there's a forward pass

1037
00:52:47,140 --> 00:52:51,443
which basically computes
exactly the sigmoid expression,

1038
00:52:51,443 --> 00:52:53,084
and then a backward pass, right,

1039
00:52:53,084 --> 00:52:57,251
where it is taking as input
something, basically a top_diff,

1040
00:53:00,031 --> 00:53:02,158
which is our upstream
gradient in this case,

1041
00:53:02,158 --> 00:53:06,325
and multiplying it by a local
gradient that we compute.

1042
00:53:08,267 --> 00:53:10,539
So in assignment one you'll get practice

1043
00:53:10,539 --> 00:53:15,277
with this kind of, this
computational graph way of thinking

1044
00:53:15,277 --> 00:53:16,829
where, you know, you're
going to be writing

1045
00:53:16,829 --> 00:53:19,279
your SVM and Softmax classes,

1046
00:53:19,279 --> 00:53:21,041
and taking the gradients of these.

1047
00:53:21,041 --> 00:53:24,005
And so again, remember always
you want to first step,

1048
00:53:24,005 --> 00:53:27,860
represent it as a
computational graph, right?

1049
00:53:27,860 --> 00:53:29,492
Figure out what are all the computations

1050
00:53:29,492 --> 00:53:31,632
that you did leading up to the output,

1051
00:53:31,632 --> 00:53:32,928
and then when you, when it's time

1052
00:53:32,928 --> 00:53:35,786
to do your backward pass,
just take the gradient

1053
00:53:35,786 --> 00:53:38,430
with respect to each of
these intermediate variables

1054
00:53:38,430 --> 00:53:42,262
that you've defined in
your computational graph,

1055
00:53:42,262 --> 00:53:46,345
and use the chain rule to
link them all together.

1056
00:53:47,630 --> 00:53:50,726
Okay, so summary of what
we've talked about so far.

1057
00:53:50,726 --> 00:53:53,037
When we get down to, you know,
working with neural networks,

1058
00:53:53,037 --> 00:53:55,112
these are going to be
really large and complex,

1059
00:53:55,112 --> 00:53:56,859
so it's going to be
impractical to write down

1060
00:53:56,859 --> 00:54:01,175
the gradient formula by hand
for all your parameters.

1061
00:54:01,175 --> 00:54:04,864
So in order to get these gradients, right,

1062
00:54:04,864 --> 00:54:08,866
we talked about how, what we
should use is backpropagation,

1063
00:54:08,866 --> 00:54:12,226
right, and this is kind of
one of the core techniques

1064
00:54:12,226 --> 00:54:15,809
of, you know, neural
networks, is basically

1065
00:54:16,660 --> 00:54:19,328
using backpropagation to
get your gradients, right?

1066
00:54:19,328 --> 00:54:21,032
And so this is a recursive application of

1067
00:54:21,032 --> 00:54:23,942
the chain rule where we have
this computational graph,

1068
00:54:23,942 --> 00:54:26,298
and we start at the back and
we go backwards through it

1069
00:54:26,298 --> 00:54:27,616
to compute the gradients with respect

1070
00:54:27,616 --> 00:54:29,984
to all of the intermediate variables,

1071
00:54:29,984 --> 00:54:31,806
which are your inputs, your parameters,

1072
00:54:31,806 --> 00:54:35,369
and everything else in the middle.

1073
00:54:35,369 --> 00:54:37,344
And we've also talked about how really

1074
00:54:37,344 --> 00:54:40,642
this implementation and
this graph structure,

1075
00:54:40,642 --> 00:54:42,746
each of these nodes is
really, you can see this

1076
00:54:42,746 --> 00:54:45,671
as implementing a forward
and backwards API, right?

1077
00:54:45,671 --> 00:54:47,976
And so in the forward
pass we want to compute

1078
00:54:47,976 --> 00:54:49,881
the results of the operation, and we want

1079
00:54:49,881 --> 00:54:51,937
to save any intermediate values

1080
00:54:51,937 --> 00:54:55,149
that we might want to use later
in our gradient computation,

1081
00:54:55,149 --> 00:54:58,676
and then in the backwards
pass we apply this chain rule

1082
00:54:58,676 --> 00:55:00,200
and we take this upstream gradient,

1083
00:55:00,200 --> 00:55:03,101
we chain it, multiply it
with our local gradient

1084
00:55:03,101 --> 00:55:05,205
to compute the gradient with respect to

1085
00:55:05,205 --> 00:55:08,240
the inputs of the node,
and we pass this down

1086
00:55:08,240 --> 00:55:11,323
to the nodes that are connected next.

1087
00:55:12,906 --> 00:55:15,263
Okay, so now finally we're going

1088
00:55:15,263 --> 00:55:17,600
to talk about neural networks.

1089
00:55:17,600 --> 00:55:21,600
All right, so really, you
know, neural networks,

1090
00:55:24,254 --> 00:55:26,429
people draw a lot of analogies
between neural networks

1091
00:55:26,429 --> 00:55:27,794
and the brain, and different types

1092
00:55:27,794 --> 00:55:30,225
of biological inspirations,
and we'll get to that in

1093
00:55:30,225 --> 00:55:33,266
a little bit, but first let's
talk about it, you know,

1094
00:55:33,266 --> 00:55:34,670
just looking at it as a function,

1095
00:55:34,670 --> 00:55:39,385
as a class of functions
without all of the brain stuff.

1096
00:55:39,385 --> 00:55:41,227
So, so far we've talked about, you know,

1097
00:55:41,227 --> 00:55:43,748
we've worked a lot with this
linear score function, right?

1098
00:55:43,748 --> 00:55:46,573
f equals W times x, and
so we've been using this

1099
00:55:46,573 --> 00:55:50,740
as a running example of a
function that we want to optimize.

1100
00:55:51,716 --> 00:55:55,195
So instead of using the
single in your transformation,

1101
00:55:55,195 --> 00:55:57,138
if we want a neural network where we

1102
00:55:57,138 --> 00:55:59,066
can just, as the simplest form,

1103
00:55:59,066 --> 00:56:01,250
just stack two of these together, right?

1104
00:56:01,250 --> 00:56:04,630
Just a linear transformation
on top of another one

1105
00:56:04,630 --> 00:56:08,122
in order to get a two-layer
neural network, right?

1106
00:56:08,122 --> 00:56:11,451
And so what this looks like is
first we have our, you know,

1107
00:56:11,451 --> 00:56:14,284
a matrix multiply of W-one with x,

1108
00:56:15,921 --> 00:56:19,342
and then we get this intermediate variable

1109
00:56:19,342 --> 00:56:23,509
and we have this non-linear
function of a max of zero

1110
00:56:24,761 --> 00:56:28,706
with W, max with this
output of this linear layer,

1111
00:56:28,706 --> 00:56:32,093
and it's really important to
have these non-linearities

1112
00:56:32,093 --> 00:56:34,050
in place, which we'll
talk about more later,

1113
00:56:34,050 --> 00:56:36,480
because otherwise if you just
stack linear layers on top

1114
00:56:36,480 --> 00:56:38,203
of each other, they're
just going to collapse

1115
00:56:38,203 --> 00:56:40,976
to, like a single linear function.

1116
00:56:40,976 --> 00:56:43,145
Okay, so we have our first linear layer

1117
00:56:43,145 --> 00:56:45,846
and then we have this
non-linearity, right,

1118
00:56:45,846 --> 00:56:49,347
and then on top of this we'll
add another linear layer.

1119
00:56:49,347 --> 00:56:52,310
And then from here, finally
we can get our score function,

1120
00:56:52,310 --> 00:56:54,962
our output vector of scores.

1121
00:56:54,962 --> 00:56:57,562
So basically, like, more broadly speaking,

1122
00:56:57,562 --> 00:57:00,189
neural networks are a class of functions

1123
00:57:00,189 --> 00:57:02,652
where we have simpler functions, right,

1124
00:57:02,652 --> 00:57:04,278
that are stacked on top of each other,

1125
00:57:04,278 --> 00:57:06,261
and we stack them in a hierarchical way

1126
00:57:06,261 --> 00:57:10,078
in order to make up a more
complex non-linear function,

1127
00:57:10,078 --> 00:57:13,827
and so this is the idea of
having, basically multiple stages

1128
00:57:13,827 --> 00:57:16,702
of hierarchical computation, right?

1129
00:57:16,702 --> 00:57:19,455
And so, you know, so this is kind of

1130
00:57:19,455 --> 00:57:22,589
the main way that we do this is by taking

1131
00:57:22,589 --> 00:57:26,220
something like this matrix
multiply, this linear layer,

1132
00:57:26,220 --> 00:57:30,204
and we just stack multiple
of these on top of each other

1133
00:57:30,204 --> 00:57:33,871
with non-linear functions
in-between, right?

1134
00:57:38,446 --> 00:57:41,185
And so one thing that this
can help solve is if we look,

1135
00:57:41,185 --> 00:57:43,135
if we remember back to
this linear score function

1136
00:57:43,135 --> 00:57:45,170
that we were talking about, right,

1137
00:57:45,170 --> 00:57:49,390
remember we discussed earlier how each row

1138
00:57:49,390 --> 00:57:53,283
of our weight matrix W was
something like a template.

1139
00:57:53,283 --> 00:57:55,537
It was a template that sort
of expressed, you know,

1140
00:57:55,537 --> 00:57:58,697
what we're looking for in the input

1141
00:57:58,697 --> 00:58:01,049
for a specific class, right,
so for example, you know,

1142
00:58:01,049 --> 00:58:02,394
the car template looks something like

1143
00:58:02,394 --> 00:58:06,110
this kind of fuzzy red car,
and we were looking for this

1144
00:58:06,110 --> 00:58:10,151
in the input to compute the
score for the car class.

1145
00:58:10,151 --> 00:58:11,588
And we talked about one
of the problems with this

1146
00:58:11,588 --> 00:58:13,566
is that there's only one template, right?

1147
00:58:13,566 --> 00:58:15,848
There's this red car, whereas in practice,

1148
00:58:15,848 --> 00:58:17,129
we actually have multiple modes, right?

1149
00:58:17,129 --> 00:58:19,537
We might want, we're looking
for, you know, a red car,

1150
00:58:19,537 --> 00:58:21,225
there's also a yellow
car, like all of these

1151
00:58:21,225 --> 00:58:24,293
are different kinds of cars, and so what

1152
00:58:24,293 --> 00:58:28,210
this kind of multiple
layer network lets you do

1153
00:58:29,287 --> 00:58:33,666
is now, you know, each of
this intermediate variable h,

1154
00:58:33,666 --> 00:58:36,785
right, W-one can still be
these kinds of templates,

1155
00:58:36,785 --> 00:58:38,943
but now you have all of these scores

1156
00:58:38,943 --> 00:58:42,006
for these templates in h,
and we can have another layer

1157
00:58:42,006 --> 00:58:45,503
on top that's combining
these together, right?

1158
00:58:45,503 --> 00:58:48,693
So we can say that actually
my car class should be,

1159
00:58:48,693 --> 00:58:50,525
you know, connected to, we're looking

1160
00:58:50,525 --> 00:58:53,602
for both red cars as well
as yellow cars, right,

1161
00:58:53,602 --> 00:58:56,652
because we have this matrix W-two

1162
00:58:56,652 --> 00:59:00,819
which is now a weighting
of all of our vector in h.

1163
00:59:05,691 --> 00:59:08,478
Okay, any questions about this?

1164
00:59:08,478 --> 00:59:09,311
Yeah?

1165
00:59:09,311 --> 00:59:13,795
[student speaking away from microphone]

1166
00:59:13,795 --> 00:59:15,529
Yeah, so there's a lot of ways,

1167
00:59:15,529 --> 00:59:17,189
so there's a lot of different
non-linear functions

1168
00:59:17,189 --> 00:59:20,821
that you can choose from,
and we'll talk later on

1169
00:59:20,821 --> 00:59:22,537
in a later lecture about
all the different kinds

1170
00:59:22,537 --> 00:59:25,880
of non-linearities that
you might want to use.

1171
00:59:25,880 --> 00:59:28,599
- [Student] For the pictures in the slide,

1172
00:59:28,599 --> 00:59:31,411
so, on the bottom row you have images

1173
00:59:31,411 --> 00:59:34,828
of your vector W-one weight, and so maybe

1174
00:59:38,703 --> 00:59:42,536
you would have images
of another vector W-two?

1175
00:59:46,014 --> 00:59:49,534
- So W-one, because it's
directly connected to

1176
00:59:49,534 --> 00:59:52,965
the input x, this is what's
like, really interpretable,

1177
00:59:52,965 --> 00:59:56,909
because you can formulate
all of these templates.

1178
00:59:56,909 --> 00:59:59,492
W-two, so h is going to be a score

1179
01:00:00,389 --> 01:00:03,570
of how much of each template
you solve, for example,

1180
01:00:03,570 --> 01:00:06,047
all right, so it might be
like you have a, you know,

1181
01:00:06,047 --> 01:00:09,640
like a, I don't know, two for the red car,

1182
01:00:09,640 --> 01:00:11,865
and like, one for the yellow
car or something like that.

1183
01:00:11,865 --> 01:00:16,678
- [Student] Oh, okay, so
instead of W-one being just 10,

1184
01:00:16,678 --> 01:00:19,327
like, you would have a left-facing horse

1185
01:00:19,327 --> 01:00:21,807
and a right-facing horse,
and they'd both be included--

1186
01:00:21,807 --> 01:00:25,864
- Exactly, so the question
is basically whether

1187
01:00:25,864 --> 01:00:27,849
in W-one you could have
both left-facing horse

1188
01:00:27,849 --> 01:00:30,532
and right-facing horse,
right, and so yeah, exactly.

1189
01:00:30,532 --> 01:00:34,590
So now W-one can be many different
kinds of templates right?

1190
01:00:34,590 --> 01:00:38,505
They're not, and then W-two,
now we can, like basically

1191
01:00:38,505 --> 01:00:41,304
it's a weighted sum of
all of these templates.

1192
01:00:41,304 --> 01:00:43,461
So now it allows you to weight
together multiple templates

1193
01:00:43,461 --> 01:00:47,014
in order to get the final
score for a particular class.

1194
01:00:47,014 --> 01:00:48,416
- [Student] So if you're
processing an image

1195
01:00:48,416 --> 01:00:50,713
then it's actually left-facing horse.

1196
01:00:50,713 --> 01:00:53,522
It'll get a really high score with

1197
01:00:53,522 --> 01:00:55,297
the left-facing horse template,

1198
01:00:55,297 --> 01:00:58,186
and a lower score with the
right-facing horse template,

1199
01:00:58,186 --> 01:01:02,552
and then this will take
the maximum of the two?

1200
01:01:02,552 --> 01:01:04,272
- Right, so okay, so the question is,

1201
01:01:04,272 --> 01:01:08,687
if our image x is like a left-facing horse

1202
01:01:08,687 --> 01:01:10,519
and in W-one we have a template of

1203
01:01:10,519 --> 01:01:13,140
a left-facing horse and
a right-facing horse,

1204
01:01:13,140 --> 01:01:14,435
then what's happening, right?

1205
01:01:14,435 --> 01:01:17,390
So what happens is yeah,
so in h you might have

1206
01:01:17,390 --> 01:01:21,199
a really high score for
your left-facing horse,

1207
01:01:21,199 --> 01:01:24,627
kind of a lower score for
your right-facing horse,

1208
01:01:24,627 --> 01:01:27,976
and W-two is, it's a weighted
sum, so it's not a maximum.

1209
01:01:27,976 --> 01:01:30,749
It's a weighted sum of these templates,

1210
01:01:30,749 --> 01:01:33,215
but if you have either a really high score

1211
01:01:33,215 --> 01:01:35,597
for one of these templates,
or let's say you have, kind of

1212
01:01:35,597 --> 01:01:38,413
a lower and medium score
for both of these templates,

1213
01:01:38,413 --> 01:01:39,968
all of these kinds of combinations

1214
01:01:39,968 --> 01:01:41,971
are going to give high scores, right?

1215
01:01:41,971 --> 01:01:43,305
And so in the end what you're going to get

1216
01:01:43,305 --> 01:01:45,153
is something that generally scores high

1217
01:01:45,153 --> 01:01:47,986
when you have a horse of any kind.

1218
01:01:49,195 --> 01:01:50,966
So let's say you had a front-facing horse,

1219
01:01:50,966 --> 01:01:53,470
you might have medium values for both

1220
01:01:53,470 --> 01:01:56,901
the left and the right templates.

1221
01:01:56,901 --> 01:01:57,963
Yeah, question?

1222
01:01:57,963 --> 01:01:59,994
- [Student] So is W-two
doing the weighting,

1223
01:01:59,994 --> 01:02:01,705
or is h doing the weighting?

1224
01:02:01,705 --> 01:02:03,506
- W-two is doing the
weighting, so the question is,

1225
01:02:03,506 --> 01:02:06,397
"Is W-two doing the weighting
or is h doing the weighting?"

1226
01:02:06,397 --> 01:02:09,710
h is the value, like in this example,

1227
01:02:09,710 --> 01:02:14,144
h is the value of scores
for each of your templates

1228
01:02:14,144 --> 01:02:17,277
that you have in W-one, right?

1229
01:02:17,277 --> 01:02:21,128
So h is like the score function, right,

1230
01:02:21,128 --> 01:02:25,851
it's how much of each
template in W-one is present,

1231
01:02:25,851 --> 01:02:29,627
and then W-two is going
to weight all of these,

1232
01:02:29,627 --> 01:02:31,341
weight all of these intermediate scores

1233
01:02:31,341 --> 01:02:33,974
to get your final score for the class.

1234
01:02:33,974 --> 01:02:36,194
- [Student] And which
is the non-linear thing?

1235
01:02:36,194 --> 01:02:38,004
- So the question is, "which
is the non-linear thing?"

1236
01:02:38,004 --> 01:02:42,171
So the non-linearity usually
happens right before h,

1237
01:02:43,643 --> 01:02:47,896
so h is the value right
after the non-linearity.

1238
01:02:47,896 --> 01:02:49,490
So we're talking about
this, like, you know,

1239
01:02:49,490 --> 01:02:51,566
intuitively as this example of like,

1240
01:02:51,566 --> 01:02:53,928
W-one is looking for, you know,

1241
01:02:53,928 --> 01:02:55,363
has these same templates as before,

1242
01:02:55,363 --> 01:02:57,152
and W-two is a weighting for these.

1243
01:02:57,152 --> 01:02:59,314
In practice it's not
exactly like this, right,

1244
01:02:59,314 --> 01:03:01,255
because as you said, there's all

1245
01:03:01,255 --> 01:03:03,358
these non-linearities thrown in and so on,

1246
01:03:03,358 --> 01:03:07,525
but it has this approximate
type of interpretation to it.

1247
01:03:08,435 --> 01:03:11,602
- [Student] So h is just W-one-x then?

1248
01:03:13,811 --> 01:03:16,715
- Yeah, yeah, so the
question is h just W-one-x?

1249
01:03:16,715 --> 01:03:20,882
So h is just W-one times x,
with the max function on top.

1250
01:03:27,004 --> 01:03:29,087
Oh, let me just, okay so,

1251
01:03:30,259 --> 01:03:31,656
so we've talked about
this as an example of

1252
01:03:31,656 --> 01:03:34,390
a two-layer neural network,
and we can stack more layers

1253
01:03:34,390 --> 01:03:37,221
of these to get deeper networks
of arbitrary depth, right?

1254
01:03:37,221 --> 01:03:39,202
So we can just do this one more time

1255
01:03:39,202 --> 01:03:43,369
at another non-linearity and
matrix multiply now by W-three,

1256
01:03:44,943 --> 01:03:47,773
and now we have a three-layer
neural network, right?

1257
01:03:47,773 --> 01:03:49,921
And so this is where the
term deep neural networks

1258
01:03:49,921 --> 01:03:51,604
is basically coming from, right?

1259
01:03:51,604 --> 01:03:55,081
This idea that you can stack
multiple of these layers,

1260
01:03:55,081 --> 01:03:57,831
you know, for very deep networks.

1261
01:04:00,153 --> 01:04:03,486
And so in homework you'll get a practice

1262
01:04:04,453 --> 01:04:07,623
of writing and you know, training one of

1263
01:04:07,623 --> 01:04:10,363
these neural networks, I
think in assignment two,

1264
01:04:10,363 --> 01:04:13,534
but basically a full
implementation of this using

1265
01:04:13,534 --> 01:04:15,533
this idea of forward pass, right,

1266
01:04:15,533 --> 01:04:18,078
and backward passes, and using chain rule

1267
01:04:18,078 --> 01:04:20,497
to compute gradients
that we've already seen.

1268
01:04:20,497 --> 01:04:22,924
The entire implementation of
a two-layer neural network

1269
01:04:22,924 --> 01:04:27,444
is actually really simple, it
can just be done in 20 lines,

1270
01:04:27,444 --> 01:04:31,131
and so you'll get some practice
with this in assignment two,

1271
01:04:31,131 --> 01:04:33,761
writing out all of these parts.

1272
01:04:33,761 --> 01:04:36,269
And okay, so now that we've sort of seen

1273
01:04:36,269 --> 01:04:38,182
what neural networks are
as a function, right,

1274
01:04:38,182 --> 01:04:41,005
like, you know, we hear
people talking a lot about

1275
01:04:41,005 --> 01:04:44,613
how there's biological
inspirations for neural networks,

1276
01:04:44,613 --> 01:04:47,816
and so even though it's
important that to emphasize

1277
01:04:47,816 --> 01:04:50,926
that these analogies are really loose,

1278
01:04:50,926 --> 01:04:53,875
it's really just very loose ties,

1279
01:04:53,875 --> 01:04:56,303
but it's still interesting to understand

1280
01:04:56,303 --> 01:04:59,149
where some of these connections
and inspirations come from.

1281
01:04:59,149 --> 01:05:03,445
And so now I'm going to
talk briefly about that.

1282
01:05:03,445 --> 01:05:05,712
So if we think about a neuron, in kind of

1283
01:05:05,712 --> 01:05:08,545
a very simple way, this neuron is,

1284
01:05:09,396 --> 01:05:11,270
here's a diagram of a neuron.

1285
01:05:11,270 --> 01:05:14,037
We have the impulses
that are carried towards

1286
01:05:14,037 --> 01:05:15,144
each neuron, right, so we have a lot

1287
01:05:15,144 --> 01:05:19,503
of neurons connected together
and each neuron has dendrites,

1288
01:05:19,503 --> 01:05:22,008
right, and these are sort
of, these are what receives

1289
01:05:22,008 --> 01:05:25,174
the impulses that come into the neuron.

1290
01:05:25,174 --> 01:05:26,706
And then we have a cell body, right,

1291
01:05:26,706 --> 01:05:30,362
that basically integrates
these signals coming in

1292
01:05:30,362 --> 01:05:34,018
and then there's a kind
of, then it takes this,

1293
01:05:34,018 --> 01:05:36,927
and after integrating all
these signals, it passes on,

1294
01:05:36,927 --> 01:05:38,786
you know, the impulse carries away from

1295
01:05:38,786 --> 01:05:42,543
the cell body to downstream
neurons that it's connected to,

1296
01:05:42,543 --> 01:05:46,376
right, and it carries
this away through axons.

1297
01:05:48,337 --> 01:05:51,580
So now if we look at what
we've been doing so far, right,

1298
01:05:51,580 --> 01:05:54,401
with each computational node, you can see

1299
01:05:54,401 --> 01:05:57,397
that this actually has, you can see it

1300
01:05:57,397 --> 01:05:59,592
in kind of a similar way, right?

1301
01:05:59,592 --> 01:06:01,694
Where nodes are connected to each other in

1302
01:06:01,694 --> 01:06:05,688
the computational graph,
and we have inputs,

1303
01:06:05,688 --> 01:06:09,983
or signals, x, x right,
coming into a neuron,

1304
01:06:09,983 --> 01:06:13,732
and then all of these x,
right, x-zero, x-one, x-two,

1305
01:06:13,732 --> 01:06:16,712
these are combined and
integrated together, right,

1306
01:06:16,712 --> 01:06:19,812
using, for example our weights, W.

1307
01:06:19,812 --> 01:06:22,821
So we do some sort of computation, right,

1308
01:06:22,821 --> 01:06:25,109
and in some of the computations
we've been doing so far,

1309
01:06:25,109 --> 01:06:29,381
something like W times x plus b, right,

1310
01:06:29,381 --> 01:06:31,610
integrating all these together,

1311
01:06:31,610 --> 01:06:33,432
and then we have an activation function

1312
01:06:33,432 --> 01:06:36,931
that we apply on top, we get
this value of this output,

1313
01:06:36,931 --> 01:06:40,764
and we pass it down to
the connecting neurons.

1314
01:06:41,733 --> 01:06:44,068
So if you look at that
this, this is actually,

1315
01:06:44,068 --> 01:06:45,898
you can think about this in
a very similar way, right?

1316
01:06:45,898 --> 01:06:49,235
Like, you know, these are
what's the signals coming in

1317
01:06:49,235 --> 01:06:53,404
are kind of the, connected
at synapses, right?

1318
01:06:53,404 --> 01:06:56,289
The synapse connecting
the multiple neurons,

1319
01:06:56,289 --> 01:06:59,745
the dendrites are
integrating all of these,

1320
01:06:59,745 --> 01:07:02,727
they're integrating all of
this information together in

1321
01:07:02,727 --> 01:07:04,226
the cell body, and then we have

1322
01:07:04,226 --> 01:07:08,489
the output carried on the output later on.

1323
01:07:08,489 --> 01:07:10,261
And so this is kind of the analogy

1324
01:07:10,261 --> 01:07:11,958
that you can draw between them,

1325
01:07:11,958 --> 01:07:15,240
and if you look at these
activation functions, right?

1326
01:07:15,240 --> 01:07:18,710
This is what basically takes
all the inputs coming in

1327
01:07:18,710 --> 01:07:21,818
and outputs one number
that's going out later on,

1328
01:07:21,818 --> 01:07:23,131
and we've talked about examples

1329
01:07:23,131 --> 01:07:26,223
like sigmoid activation function, right,

1330
01:07:26,223 --> 01:07:28,294
and different kinds of non-linearities,

1331
01:07:28,294 --> 01:07:32,461
and so sort of one kind of
loose analogy that you can draw

1332
01:07:34,601 --> 01:07:37,340
is that these
non-linearities can represent

1333
01:07:37,340 --> 01:07:39,831
something sort of like the firing,

1334
01:07:39,831 --> 01:07:42,368
or spiking rate of the neurons, right?

1335
01:07:42,368 --> 01:07:46,771
Where our neurons transmit
signals to connecting neurons

1336
01:07:46,771 --> 01:07:49,174
using kind of these
discrete spikes, right?

1337
01:07:49,174 --> 01:07:51,215
And so we can think of, you know,

1338
01:07:51,215 --> 01:07:53,909
if they're spiking very
fast then there's kind of

1339
01:07:53,909 --> 01:07:56,664
a strong signal that's passed later on,

1340
01:07:56,664 --> 01:07:58,266
and so we can think of this value

1341
01:07:58,266 --> 01:08:02,035
after our activation function
as sort of, in a sense,

1342
01:08:02,035 --> 01:08:05,335
sort of this firing rate
that we're going to pass on.

1343
01:08:05,335 --> 01:08:08,084
And you know in practice,
I think neuroscientists

1344
01:08:08,084 --> 01:08:09,916
who are actually studying this say

1345
01:08:09,916 --> 01:08:11,614
that kind of one of the non-linearities

1346
01:08:11,614 --> 01:08:14,493
that are most similar to the way

1347
01:08:14,493 --> 01:08:16,247
that neurons are actually behaving

1348
01:08:16,247 --> 01:08:19,560
is a ReLU function, which
is a ReLU non-linearity,

1349
01:08:19,560 --> 01:08:20,535
which is something that we're going

1350
01:08:20,536 --> 01:08:23,423
to look at more later
on, but it's a function

1351
01:08:23,423 --> 01:08:28,160
that's at zero for all
negative values of input,

1352
01:08:28,160 --> 01:08:31,716
and then it's a linear
function for everything

1353
01:08:31,716 --> 01:08:34,072
that's in kind of a positive regime.

1354
01:08:34,072 --> 01:08:35,362
And so, you know, we'll talk more about

1355
01:08:35,362 --> 01:08:36,727
this activation function later on,

1356
01:08:36,727 --> 01:08:39,176
but that's kind of, in practice,

1357
01:08:39,176 --> 01:08:41,171
maybe the one that's most similar to

1358
01:08:41,171 --> 01:08:44,004
how neurons are actually behaving.

1359
01:08:46,020 --> 01:08:49,046
But it's really important
to be extremely careful

1360
01:08:49,046 --> 01:08:52,056
with making any of these
sorts of brain analogies,

1361
01:08:52,057 --> 01:08:53,978
because in practice biological neurons

1362
01:08:53,978 --> 01:08:56,388
are way more complex than this.

1363
01:08:56,388 --> 01:09:00,216
There's many different
kinds of biological neurons,

1364
01:09:00,216 --> 01:09:01,352
the dendrites can perform

1365
01:09:01,352 --> 01:09:04,951
really complex non-linear computations.

1366
01:09:04,951 --> 01:09:07,884
Our synapses, right, the
W-zeros that we had earlier

1367
01:09:07,884 --> 01:09:11,381
where we drew this analogy,
are not single weights

1368
01:09:11,381 --> 01:09:14,033
like we had, they're actually
really complex, you know,

1369
01:09:14,033 --> 01:09:17,087
non-linear dynamical systems in practice,

1370
01:09:17,087 --> 01:09:21,602
and also this idea of interpreting
our activation function

1371
01:09:21,602 --> 01:09:25,524
as a sort of rate code
or firing rate is also,

1372
01:09:25,524 --> 01:09:28,522
is insufficient in practice, you know.

1373
01:09:28,523 --> 01:09:32,251
It's just this kind of
firing rate is probably not

1374
01:09:32,251 --> 01:09:34,671
a sufficient model of how neurons

1375
01:09:34,671 --> 01:09:37,214
will actually communicate to
downstream neurons, right,

1376
01:09:37,215 --> 01:09:41,366
like even as a very simple
way, there's a very,

1377
01:09:41,366 --> 01:09:44,549
the neurons will fire at a variable rate,

1378
01:09:44,549 --> 01:09:47,843
and this variability probably
should be taken into account.

1379
01:09:47,843 --> 01:09:49,913
And so there's all of these, you know,

1380
01:09:49,913 --> 01:09:53,357
it's kind of a much more complex thing

1381
01:09:53,358 --> 01:09:56,226
than what we're dealing with.

1382
01:09:56,226 --> 01:10:00,692
There's references, for example
this dendritic computation

1383
01:10:00,692 --> 01:10:03,944
that you can look at if you're
interested in this topic,

1384
01:10:03,944 --> 01:10:06,374
but yeah, so that in practice, you know,

1385
01:10:06,374 --> 01:10:08,701
we can sort of see how
it may resemble a neuron

1386
01:10:08,701 --> 01:10:11,883
at this very high level, but neurons are,

1387
01:10:11,883 --> 01:10:15,916
in practice, much more
complicated than that.

1388
01:10:15,916 --> 01:10:18,998
Okay, so we talked about how
there's many different kinds

1389
01:10:18,998 --> 01:10:21,118
of activation functions
that could be used,

1390
01:10:21,118 --> 01:10:24,065
there's the ReLU that I mentioned earlier,

1391
01:10:24,065 --> 01:10:25,820
and we'll talk about all
of these different kinds

1392
01:10:25,820 --> 01:10:29,828
of activation functions in
much more detail later on,

1393
01:10:29,828 --> 01:10:31,868
choices of these activation functions

1394
01:10:31,868 --> 01:10:34,474
that you might want to use.

1395
01:10:34,474 --> 01:10:36,140
And so we'll also talk
about different kinds

1396
01:10:36,140 --> 01:10:38,908
of neural network architectures.

1397
01:10:38,908 --> 01:10:42,825
So we gave the example
of these fully connected

1398
01:10:44,362 --> 01:10:46,243
neural networks, right,
where each layer is

1399
01:10:46,243 --> 01:10:50,410
this matrix multiply, and
so the way we actually want

1400
01:10:52,362 --> 01:10:55,969
to call these is, we said
two-layer neural network before,

1401
01:10:55,969 --> 01:10:58,535
and that corresponded to
the fact that we have two

1402
01:10:58,535 --> 01:11:01,356
of these linear layers,
right, where we're doing

1403
01:11:01,356 --> 01:11:04,374
a matrix multiply, two
fully connected layers

1404
01:11:04,374 --> 01:11:06,507
is what we call these.

1405
01:11:06,507 --> 01:11:09,183
We could also call this a
one-hidden-layer neural network,

1406
01:11:09,183 --> 01:11:10,384
so instead of counting the number

1407
01:11:10,384 --> 01:11:13,186
of matrix multiplies we're doing,

1408
01:11:13,186 --> 01:11:15,519
counting the number of
hidden layers that we have.

1409
01:11:15,519 --> 01:11:17,654
I think it's, you can use either,

1410
01:11:17,654 --> 01:11:19,011
I think maybe two-layer neural network

1411
01:11:19,011 --> 01:11:21,918
is something that's a
little more commonly used.

1412
01:11:21,918 --> 01:11:24,033
And then also here, for our
three-layer neural network

1413
01:11:24,033 --> 01:11:27,319
that we have, this can also be called

1414
01:11:27,319 --> 01:11:30,152
a two-hidden-layer neural network.

1415
01:11:32,770 --> 01:11:36,759
And so we saw that, you
know, when we're doing

1416
01:11:36,759 --> 01:11:38,425
this type of feed forward, right,

1417
01:11:38,425 --> 01:11:41,330
forward pass through a neural network,

1418
01:11:41,330 --> 01:11:44,247
each of these nodes in this network

1419
01:11:46,768 --> 01:11:50,254
is basically doing the
kind of operation of

1420
01:11:50,254 --> 01:11:53,587
the neuron that I showed earlier, right?

1421
01:11:55,875 --> 01:11:57,983
And so what's actually happening is

1422
01:11:57,983 --> 01:12:00,496
is basically each hidden
layer you can think of

1423
01:12:00,496 --> 01:12:03,777
as a whole vector, right,
a set of these neurons,

1424
01:12:03,777 --> 01:12:06,113
and so by writing it out this way

1425
01:12:06,113 --> 01:12:10,828
with these matrix multiplies
to compute our neuron values,

1426
01:12:10,828 --> 01:12:13,163
it's a way that we can
efficiently evaluate

1427
01:12:13,163 --> 01:12:14,771
this entire layer of neurons, right?

1428
01:12:14,771 --> 01:12:18,554
So with one matrix multiply
we get output values

1429
01:12:18,554 --> 01:12:21,664
of, you know, of a layer of let's say 10,

1430
01:12:21,664 --> 01:12:23,664
or 50 or 100 of neurons.

1431
01:12:26,389 --> 01:12:31,082
All right, and so looking at
this again, writing this out,

1432
01:12:31,082 --> 01:12:34,597
all out in matrix form,
matrix-vector form,

1433
01:12:34,597 --> 01:12:38,179
we have our, you know, non-linearity here.

1434
01:12:38,179 --> 01:12:40,743
F that we're using, in this
case a sigmoid function, right,

1435
01:12:40,743 --> 01:12:44,051
and we can take our data
x, some input vector

1436
01:12:44,051 --> 01:12:48,226
or our values, and we can apply
our first matrix multiply,

1437
01:12:48,226 --> 01:12:51,976
W-one on top of this,
then our non-linearity,

1438
01:12:52,985 --> 01:12:54,436
then a second matrix multiply to get

1439
01:12:54,436 --> 01:12:56,128
a second hidden layer, h-two,

1440
01:12:56,128 --> 01:12:59,183
and then we have our final output, right?

1441
01:12:59,183 --> 01:13:02,231
And so, you know, this
is basically all you need

1442
01:13:02,231 --> 01:13:05,289
to be able to write a neural network,

1443
01:13:05,289 --> 01:13:08,369
and as we saw earlier, the backward pass.

1444
01:13:08,369 --> 01:13:11,199
You then just use backprop
to compute all of those,

1445
01:13:11,199 --> 01:13:14,199
and so that's basically all there is

1446
01:13:15,289 --> 01:13:19,456
to kind of the main idea
of what's a neural network.

1447
01:13:21,451 --> 01:13:24,115
Okay, so just to
summarize, we talked about

1448
01:13:24,115 --> 01:13:28,454
how we could arrange neurons
into these computations, right,

1449
01:13:28,454 --> 01:13:32,043
of fully-connected or linear layers.

1450
01:13:32,043 --> 01:13:34,642
This abstraction of a
layer has a nice property

1451
01:13:34,642 --> 01:13:36,726
that we can use very
efficient vectorized code

1452
01:13:36,726 --> 01:13:38,604
to compute all of these.

1453
01:13:38,604 --> 01:13:40,601
We also talked about how it's important

1454
01:13:40,601 --> 01:13:42,628
to keep in mind that neural networks

1455
01:13:42,628 --> 01:13:45,805
do have some, you know,
analogy and loose inspiration

1456
01:13:45,805 --> 01:13:48,714
from biology, but they're
not really neural.

1457
01:13:48,714 --> 01:13:50,692
I mean, this is a pretty loose analogy

1458
01:13:50,692 --> 01:13:53,035
that we're making, and
next time we'll talk

1459
01:13:53,035 --> 01:13:56,319
about convolutional neural networks.

1460
01:13:56,319 --> 00:00:00,000
Okay, thanks.